# 16.2 - How SAS Match-Merges

The match-merge examples we looked at in the previous section started first with the simplest case and then increased in complexity. The truth is that match-merging can be even more complex than we've seen. For example, we haven't yet investigated the situation in which we merge by more than one variable at a time. It is because of this potential complexity that you will be best served by understanding how the DATA step performs match-merges. Only then will you be able to predict the results of, and therefore perform, match-merges correctly.

In this section, we'll revisit two examples from the previous section first Example 16.4 and then Example 16.5. Our focus this time won't be on how to perform the match-merge, but rather to see how SAS performs the match-merge.

**Example 16.6**. The following program is identical to the first program that appears in Example 16.4:

You might want to launch and re-run the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that, in addition to the linking variable *subj*, the *moredemog* and *morestatus* data sets share a variable called *v_date*. In this case, the variable *v_date* gets its values from the *morestatus* data set, since it is the last data set to appear in the MERGE statement.

Our focus this time is going to be to work our way through the program to see how SAS match-merges the *moredemog* and *morestatus* data sets to get the *morepatients* data set. As always, at the end of the compile phase, SAS will have created a program data vector containing the variables from the *moredemog* and *morestatus* data sets in the order in which they appear in the DATA step:

As always, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "**BY group**" in each data set to determine which BY group should appear first. In this case, the first BY group, that is, observations for which *subj* = 1000, is the same in both data sets. So, SAS reads the first observation from the *moredemog* data set:

and then SAS reads the first observation from the *morestatus* data set:

As you can see, the value of *v_date* from the *moredemog* data set (03/10/96) gets written over by the value of *v_date* from the *morestatus* data set (03/17/96). Having no more data sets from which to read, SAS writes the first observation to the *morepatients* data set, and **retains the values in the program data vector**. (If the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

**SAS then looks in each data set for a second observation** in the *subj* = 1000 BY group. **Neither data set has one**. Therefore,** SAS sets all of the values in the program data vector to missing and begins processing the next BY group**:

SAS reads the second observation from the *moredemog* data set:

and then SAS reads the second observation from the *morestatus* data set:

Again, the value of *v_date* from the *moredemog* data set (02/19/96) gets written over by the value of *v_date* from the *morestatus* data set (03/01/96). Having no more data sets from which to read, SAS writes the second observation to the *morepatients* data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the *subj* = 1001 BY group. Because neither data set has one, SAS sets all of the values in the program data vector to missing and begins processing the next BY group. Shall we stop this seemingly endless cycle? You should have the basic idea now... SAS continues processing observations as just described until it exhausts all of the observations in both data sets.

**Example 16.7**. The following program is identical to the first program that appears in Example 16.5:

Again, you might want to launch and re-run the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that the observations in the first data set to be merged matches one or more observations in the second data set to be merged. For example, the *salesone* data set contains two observations for the 2004 year, while the *salestwo* data set contains three observations for the 2004 year.

Let's work our way through this match-merge. At the end of the compile phase, SAS will have created a program data vector containing the variables from the *salesone* and *salestwo *data sets in the order in which they appear in the DATA step:

Again, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "**BY group**" in each data set to determine which BY group should appear first. In this case, the first BY group, that is, observations for which *year* = 2004, is the same in both data sets. So, SAS reads the first observation from the *salesone* data set:

and then SAS reads the first observation from the *salestwo* data set:

You now shouldn't be surprised to see that the value of *sales* from the *salesone* data set (100) gets written over by the value of *sales* from the *salestwo* data set (700). Having no more data sets from which to read, SAS writes the first observation to the *allsales* data set, and retains the values in the program data vector. (Again, if the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

SAS then looks in each data set for a second observation in the *year* = 2004 BY group. Both data sets have one. Therefore, SAS reads the second observation from the *salesone* data set:

The value 7 stored in the *loc* variable is a remnant from the previous iteration of the DATA step. It doesn't get written over until SAS reads the second observation from the *salestwo* data set:

Again, the value of *sales* from the *salesone* data set (200) gets written over by the value of *sales* from the *salestwo* data set (800). Having no more data sets from which to read, SAS writes the second observation to the *allsales* data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the *subj* = 2004 BY group. The *salesone* data set has no more, but the *salestwo* data set has one more. So, SAS retains the value of the *prd* variable from the previous iteration and reads the next observation from the *salestwo* data set:

Again, the value of *sales* that was retained from the last iteration (800) gets written over by the value of *sales* from the *salestwo* data set (900). Having no more data sets from which to read, SAS writes the third observation to the *allsales* data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the *year* = 2004 BY group. Neither data set has one. Therefore,** **SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

When is this circus going to end?! Soon... bear with us! Okay... so SAS determines that the *year* = 2005 is the next BY group and that *salesone* has an observation for that BY group, while *salestwo* does not. SAS reads the available observation from the *salesone* data set:

and having no more observations to read, SAS writes the contents of the program data vector to the fourth observation in the *allsales* data set.

SAS then looks in each data set for another observation in the *year* = 2005 BY group. Neither data set has one. Therefore,** **SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

Okay, let's cry uncle! At this point, it really is a straightforward trek to the finish line provided you follow the general guidelines, which are probably worth summarizing once and for all:

- At the beginning of the execution phase, set the values in the program data vector to missing.
- Determine the first BY group, and read all of the observations from all of the data sets having the BY group.
- Write the contents of the program data vector to the output data set.
- If there are more observations in the BY group, retain the values in the program data vector.
- If there are no more observations in the BY group, set all of the values in the program data vector to missing and begin processing the next BY group.
- Continue processing observations as described until all of the observations in all of the input data sets are exhausted.

Using these guidelines, you should be able to predict the contents of just about any match-merge you encounter.