Published on *STAT 481* (https://onlinecourses.science.psu.edu/stat481)

In the last lesson, we learned different ways of combining SAS data sets — one-to-one reading, one-to-one merging, concatenating and interleaving. In this lesson, we'll finish up our work in this arena by investigating the process of ** match-merging, **in which we** ** combine two or more SAS data sets based on the values of one or more common variables using MERGE and BY statements. This method deserves its own lesson, because it is far and away the most commonly used method of combining SAS data sets. Once we've learned the how to match-merge two or more SAS data sets, we'll also spend some time exploring how to use DATA step options at the same time that match-merge.

Upon completing this lesson, you should be able to do the following:

- recognize that SAS data sets must be sorted properly in order to match-merge them
- write code to match-merge data sets that contain the same number of observations, when each observation in the first data set matches with exactly one observation in the second data set
- write code to match-merge data sets that are sorted by the BY variable in descending order
- write code to match-merge data sets that contain a different number of observations, when each observation in the first data set matches with no more than one observation in the second data set
- write code to match-merge data sets that share, aside from the variable that links the data sets together, at least one common variable name
- write code to match-merge data sets when the observations in the first data set to be merged matches one or more observations in the second data set to be merged
- understand how SAS completes a match-merge, and therefore be able to predict the outcome of a match-merge
- use the various DATA step options that are available when match-merging

In order to complete the lesson you should:

- Read the lesson pages that follows.
- Type up your answers to the homework problems in a Word file named
**homework16_yourPSUloginid**. If your PSU login id is*xyz123*, then name your file*homework16_xyz123*. - Upload the file to the
**Lesson #16 Homework Dropbox**. - Post any questions or comments you have concerning the lesson's material to the
**Lesson #16 General Discussion Board**. - Take the
**Lesson #16 Mastery Quiz**. Remember two things: i) You have 20 minutes to complete the quiz, and ii) as soon as you hit the "submit" button, your answers are submitted and graded, and the quiz becomes closed to you.

Match-merging is one of the most powerful methods of combining two or more SAS data sets. A match-merge combines observations across data sets based on the values of one or more common variables. For example, when data set *bird*:

and data set *bee*:

are match-merged by the variable *year*, we get data set *baby*:

As you can see, the observations in the *baby* data set are created by joining observations from the *bird* and *bee* data sets that share the same *year*. To match-merge, you simply specify the data sets you would like to merge in a MERGE statement, and indicate the variables on which you would like to merge in a BY statement. One thing to keep in mind, though, is **you can't match-merge SAS data sets unless they are sorted by the variables appearing in the BY statement**.

Throughout the rest of this section, we will look at examples of match-merges under various situations. In increasing order of complexity, the situations are:

- a match-merge in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set
- a match-merge in which the data sets to be merged are sorted by the BY variable in descending order
- a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set
- a match-merge in which the data sets to be merged share, aside from the variable that links the data sets together, at least one common variable name
- a match-merge in which the observations in the first data set to be merged matches one or more observations in the second data set to be merged

Let's jump right in!

**Example 16.1. **The following program illustrates the simplest case of a match-merge, in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set. Specifically, the *demog* and *status* data sets each contains five observations — one for each *subj* 1000, 1001, 1002, 1003, and 1004:

[1] | [2] |

Of course, the first two DATA steps just read in the respective *demog* and *status* data sets. Note that the two data sets are "**linked**" by the subject's ID number (*subj*). The meat of the match-merge takes place in the third (and last) DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (*demog* and *status*), by the variable appearing in the BY statement (*subj*), and to store the result in a new data set called *patients*. This is a perfectly valid request because both of the data sets (*demog* and *status*) are already sorted — in ascending order — by the *subj* variable.

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the data set *patients* contains five observations — one observation for each of the five subjects appearing in both the *demog* and *status* data sets.

**Example 16.2. **The following program is another example of a simple match-merge, except this time the data sets to be merged — *descdemog* and *descstatus* — are sorted by the BY variable — *subj* — in descending order:

[3] | [4] |

The two SORT procedures tell SAS to sort the *demog* and *status* data sets by the variable *subj*, and to store the results, respectively, in data sets called *descdemog* and *descstatus*. The keyword DESCENDING that precedes the *subj* variable in each SORT procedure tells SAS to sort the *subj* values in descending order — that is, with the largest *subj* value appearing first, and the smallest *subj* value appearing last.

The match-merge again takes place in the third DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (*descdemog* and *descstatus*), by the variable appearing in the BY statement (*subj*), and to store the result in a new data set called *descpatients*. Note that, to accomplish the match-merge, we have to place the DESCENDING keyword in the BY statement before the *subj* variable, so that SAS knows that the input data sets (*descdemog* and *descstatus*) are sorted in descending order.

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the data set *descpatients* contains five observations created from the observations in *descdemog* and *descstatus*, and arranged in descending order of the *subj* variable.

**Example 16.3.** The following program illustrates a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set. Specifically, the *newdemog* data set contains six observations — one for each *subj* 1000, 1001, 1002, 1003, 1004, and 1005 — while the *status* data set contains just five observations — one for each *subj* 1000, 1001, 1002, 1003, and 1004:

[5] | [6] |

Again, the first two DATA steps just tell SAS to read in the *newdemog* and *status* data sets. The third DATA step tells SAS to match-merge the *newdemog* and *status* data sets by the *subj* variable, and to store the result in a new data set called *newpatients*. Although the *newdemog* and *status* data sets contain different numbers of observations, SAS continues to merge the two data sets until it runs out of observations. Where necessary, SAS fills variables with missing values.

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the merged data set *newpatients* contains six subjects. SAS fills the variables in the *newpatients* data set with values based on the variables' values in the data set from which they came. Because subject 1005 does not appear in the* status *data set, SAS assigns a missing value to the variables *disease* and *test* for subject 1005.

**Example 16.4. **In the examples we've looked at so far, the variables in the data sets to be merged were unique. That is, the data sets did not share any common variable names other than the variable that linked the data sets together. The following program illustrates how SAS merges two data sets when there are common variables across the data sets besides the linking variable(s). Specifically, in addition to the linking variable *subj*, the *moredemog* and *morestatus* data sets share a variable called *v_date* (for visit date):

[7] | [8] |

Again, the first two DATA steps just tell SAS to read in the *moredemog* and *morestatus* data sets. As you can see, the *moredemog *and *morestatus* data sets have in common not only the linking variable, *subj*, but also the *v_date* variable. So, when the two data sets are merged into a data set called *morepatients*, as happens in the third DATA step, the variable *v_date* gets its values from the *morestatus* data set. That's because *morestatus* appears last in the MERGE statement. So, the values of *v_date* from* moredemog *get over-written in the program data vector by the values of *v_date* in *morestatus*. Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the values for *v_date* in the merged data set *morepatients* are the same as the values for *v_date* in the *morestatus* data set.

So, in general, if data sets share common variable names, the variable in the merged data set takes its value from the data set appearing last in the MERGE statement. To reinforce the point, the following data step merges data sets *morestatus* and *moredemog* again, but this time with *moredemog *appearing after *morestatus* in the MERGE statement:

[9] | [10] |

Here, the variable *v_date* in the *morepatients2* data set gets its value from the *moredemog* data set. That's because *moredemog* appears last in the MERGE statement. Therefore, the values of *v_date* from* morestatus *get over-written in the program data vector by the values of *v_date* in *moredemog*. Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that this time the values for *v_date* in the merged data set *morepatients2 * are the same as the values for *v_date* in the *moredemog* data set.

**Example 16.5. **The following program illustrates a match-merge in which the observations in the first data set to be merged matches one or more observations in the second data set to be merged. Case in point, the *salesone* data set contains two observations for the 2004 year, while the *salestwo* data set contains three observations for the 2004 year:

[11] | [12] |

Again, the first two DATA steps just tell SAS to read in the *salesone* and *salestwo* data sets. Note that the *salesone *data set* *contains two 2004 years, while the *salestwo* data set contains three 2004 years. And, although the *salesone* data set contains a 2005 year, the *salestwo *data set contains no 2005 year. The third DATA step tells SAS to match-merge the *salesone* and *salestwo* data sets by the *year* variable, and to store the result in a new data set called *allsales*. When you launch and run this program, you should get the following output:

Read what follows slowly and carefully, as it gets messy! As you can see, the last three observations for the 2006, 2007, and 2008 years are obtained simply:

- by merging the 2006 observation in
*salesone*with the 2006 observation in*salestwo*, - by merging the 2007 observation in
*salesone*with the 2007 observation in*salestwo*, - and by merging the 2008 observation in
*salesone*with the 2008 observation in*salestwo*.

Note that because the *sales* variable is common to both the *salesone* and* salestwo *data sets, its values in the merged *allsales* data set comes from *salestwo*,* *the data set appearing last in the MERGE statement.

Let's now focus on the more difficult years of 2004 and 2005. Recall that the *salesone* data set contains two 2004 years and one 2005 year, while the* salestwo* data set contains three 2004 years and no 2005 years. The first two observations are again obtained simply enough:

- by merging the first 2004 observation in the
*salesone*data set with the first 2004 observation in the*salestwo*data set, - and by merging the second 2004 observation in the
*salesone*data set with the second 2004 observation in the*salestwo*data set*.*

Again, because the *sales* variable is common to both the *salesone* and* salestwo *data sets, its values in the merged *allsales* data set comes from *salestwo*,* *the data set appearing last in the MERGE statement.

Now, for the messiest observation of all! SAS creates the third observation for 2004 by first getting what it can from the *salesone* data set — it gets *year* 2004, *prd* 2 (the last known value of it), and *sales* 200 (the last known value of it). Then, it gets what it can from the *salestwo* data set — it gets *year* 2004, *loc *9, and *sales* 900 (and thereby writing over the value of 200 it obtained from the *salesone* data set).

Similarly, SAS creates the fourth observation for 2005 by first getting what it can from the *salesone* data set* * — it gets *year* 2005, *prd* 3, and *sales* 300. Then, SAS gets what it can from the *salestwo * data set —which is nothing. Hence, *loc* is assigned a missing value for 2005.

So, do you get it? What do you think the data set would look like if you merged in the reverse order? The following DATA step tells SAS to merge *salestwo* and *salesone* by *year* and to store the result in a new data set called *allsales2*:

[13] | [14] |

Launch and run this program. Then, see if you can justify each of the values obtained in the resulting output:

**One closing comment**. Note that although each of our match-merging examples involved combining just two data sets, you can specify any number of data sets in a MERGE statement.

The match-merge examples we looked at in the previous section started first with the simplest case and then increased in complexity. The truth is that match-merging can be even more complex than we've seen. For example, we haven't yet investigated the situation in which we merge by more than one variable at a time. It is because of this potential complexity that you will be best served by understanding how the DATA step performs match-merges. Only then will you be able to predict the results of, and therefore perform, match-merges correctly.

In this section, we'll revisit two examples from the previous section first Example 16.4 and then Example 16.5. Our focus this time won't be on how to perform the match-merge, but rather to see how SAS performs the match-merge.

**Example 16.6**. The following program is identical to the first program that appears in Example 16.4:

[15] | [16] |

You might want to launch and re-run the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that, in addition to the linking variable *subj*, the *moredemog* and *morestatus* data sets share a variable called *v_date*. In this case, the variable *v_date* gets its values from the *morestatus* data set, since it is the last data set to appear in the MERGE statement.

Our focus this time is going to be to work our way through the program to see how SAS match-merges the *moredemog* and *morestatus* data sets to get the *morepatients* data set. As always, at the end of the compile phase, SAS will have created a program data vector containing the variables from the *moredemog* and *morestatus* data sets in the order in which they appear in the DATA step:

As always, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "**BY group**" in each data set to determine which BY group should appear first. In this case, the first BY group, that is, observations for which *subj* = 1000, is the same in both data sets. So, SAS reads the first observation from the *moredemog* data set:

and then SAS reads the first observation from the *morestatus* data set:

As you can see, the value of *v_date* from the *moredemog* data set (03/10/96) gets written over by the value of *v_date* from the *morestatus* data set (03/17/96). Having no more data sets from which to read, SAS writes the first observation to the *morepatients* data set, and **retains the values in the program data vector**. (If the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

**SAS then looks in each data set for a second observation** in the *subj* = 1000 BY group. **Neither data set has one**. Therefore,** SAS sets all of the values in the program data vector to missing and begins processing the next BY group**:

SAS reads the second observation from the *moredemog* data set:

and then SAS reads the second observation from the *morestatus* data set:

Again, the value of *v_date* from the *moredemog* data set (02/19/96) gets written over by the value of *v_date* from the *morestatus* data set (03/01/96). Having no more data sets from which to read, SAS writes the second observation to the *morepatients* data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the *subj* = 1001 BY group. Because neither data set has one, SAS sets all of the values in the program data vector to missing and begins processing the next BY group. Shall we stop this seemingly endless cycle? You should have the basic idea now... SAS continues processing observations as just described until it exhausts all of the observations in both data sets.

**Example 16.7**. The following program is identical to the first program that appears in Example 16.5:

[17] | [18] |

Again, you might want to launch and re-run the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that the observations in the first data set to be merged matches one or more observations in the second data set to be merged. For example, the *salesone* data set contains two observations for the 2004 year, while the *salestwo* data set contains three observations for the 2004 year.

Let's work our way through this match-merge. At the end of the compile phase, SAS will have created a program data vector containing the variables from the *salesone* and *salestwo *data sets in the order in which they appear in the DATA step:

Again, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "**BY group**" in each data set to determine which BY group should appear first. In this case, the first BY group, that is, observations for which *year* = 2004, is the same in both data sets. So, SAS reads the first observation from the *salesone* data set:

and then SAS reads the first observation from the *salestwo* data set:

You now shouldn't be surprised to see that the value of *sales* from the *salesone* data set (100) gets written over by the value of *sales* from the *salestwo* data set (700). Having no more data sets from which to read, SAS writes the first observation to the *allsales* data set, and retains the values in the program data vector. (Again, if the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

SAS then looks in each data set for a second observation in the *year* = 2004 BY group. Both data sets have one. Therefore, SAS reads the second observation from the *salesone* data set:

The value 7 stored in the *loc* variable is a remnant from the previous iteration of the DATA step. It doesn't get written over until SAS reads the second observation from the *salestwo* data set:

Again, the value of *sales* from the *salesone* data set (200) gets written over by the value of *sales* from the *salestwo* data set (800). Having no more data sets from which to read, SAS writes the second observation to the *allsales* data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the *subj* = 2004 BY group. The *salesone* data set has no more, but the *salestwo* data set has one more. So, SAS retains the value of the *prd* variable from the previous iteration and reads the next observation from the *salestwo* data set:

Again, the value of *sales* that was retained from the last iteration (800) gets written over by the value of *sales* from the *salestwo* data set (900). Having no more data sets from which to read, SAS writes the third observation to the *allsales* data set, and retains the values in the program data vector.

SAS then looks in each data set for another observation in the *year* = 2004 BY group. Neither data set has one. Therefore,** **SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

When is this circus going to end?! Soon... bear with us! Okay... so SAS determines that the *year* = 2005 is the next BY group and that *salesone* has an observation for that BY group, while *salestwo* does not. SAS reads the available observation from the *salesone* data set:

and having no more observations to read, SAS writes the contents of the program data vector to the fourth observation in the *allsales* data set.

SAS then looks in each data set for another observation in the *year* = 2005 BY group. Neither data set has one. Therefore,** **SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

Okay, let's cry uncle! At this point, it really is a straightforward trek to the finish line provided you follow the general guidelines, which are probably worth summarizing once and for all:

- At the beginning of the execution phase, set the values in the program data vector to missing.
- Determine the first BY group, and read all of the observations from all of the data sets having the BY group.
- Write the contents of the program data vector to the output data set.
- If there are more observations in the BY group, retain the values in the program data vector.
- If there are no more observations in the BY group, set all of the values in the program data vector to missing and begin processing the next BY group.
- Continue processing observations as described until all of the observations in all of the input data sets are exhausted.

Using these guidelines, you should be able to predict the contents of just about any match-merge you encounter.

As is the case when combining data sets by other methods, data set options, such as RENAME=, IN=, DROP=, and KEEP=, can be used when match-merging data sets. In this section, we'll look at an example that uses the RENAME= option to rename variable names that are shared by the data steps to be merged.

**Example 16.8**. The following program uses the RENAME= option to rename the *v_date* variables in the *demogtwo* and *statustwo* data sets, so that when they are merged into a new data set called *patientstwo*, both visit dates are preserved:

[19] | [20] |

When reviewing the first two DATA steps, in which we tell SAS to read in the *demogtwo *and *statustwo* data sets, note that both of the data sets contain a date variable called *v_date*. The third DATA step tells SAS to merge the *demogtwo *and *statustwo* data sets by the *subj* variable, and when doing so change the name of the *v_date* variable in the *demogtwo* data set to *demogdate* and the name of the *v_date* variable in the *statustwo* data set to *statusdate*. Because of this renaming that takes place, rather than the program data vector looking like this:

it looks like this:

Therefore, the merge reduces to a simple match-merge in which all of the values in the input data sets have a rightful position in the program data vector and are therefore preserved.

Launch and run the program, and review the output to convince yourself that the *demogtwo* and *statustwo* data sets are merged by *subj* successfully, and that the values in each input data set are preserved in the output data set *patientstwo*.

By default, when match-merging, the DATA step combines all of the observations in all of the input data sets. There may be situations, however, in which we want SAS to select only those observations for which a match exists in all of the input data sets. This is a rather common thing to do when, for instance, one of the data sets contains a master list of patients, say, and the other data sets contain information on actions taken by the patients. When we merge the data sets, we'd only want to include those patients who appear both in the master list data set *and* in the action data set. For example, suppose we have a data set containing a master list of *patients*:

and a data set called *allvoids* containing information about some patients' urinary voiding habits (*void_no* and *volume*) for a given visit date (*v_date*):

Note that the *patients* data set contains three *id* numbers (110011, 110012, and 110013), whereas the *allvoids* data set contains information on four *id* numbers (110011, 110012, 110013, and 110014). Because the *patients* data set is viewed as the master list of patients to include in any statistical analysis, we would want to merge the *patients* and *allvoids* data set in such a way to ensure that only those patients whose *id* appears in both input data sets are included in the output *analysis* data set:

Fortunately, telling SAS to exclude unmatched observations is as simple as using the IN= data set option along with a subsetting IF statement. Specifically:

- We use the IN= DATA step option to create a variable that indicates whether the data set contributed to the current observation.
- Then, we use a subsetting IF statement to check the IN= values and to write to the output (merged) data set only those observations that appear in the data sets for which IN= specified.

Let's try this approach out on our *patients* and *allvoids* data sets!

**Example 16.9.** The following program simply reads in the *patients* and *allvoids* data sets:

Launch and run the program, so that we can work with the data sets. Note that the two data sets are linked by the variables *id* and *v_date*. The *patients* data set contains one observation for each of three patients. And, the *allvoids* data set contains multiple observations for each of four patients. Incidentally, in case you are curious, each observation in the *allvoids* data set corresponds to a patient's reported void of urine. For example, subject 110011 reported four voids on 01/01/06, subject 110012 reported three voids on 01/02/06, and so on.

Now, if we wanted to analyze the voiding data, we'd have to make sure that we didn't include any data from patients not included in the *patients* data set. That is, we'd want to to exclude the voiding data corresponding to subject 110014. If we merge the *patients* and *allvoids* data sets by the *id *and *v_date* variables, we get an *analysis* data set that has one observation for each patient's reported void:

[22] | [23] |

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data sets were merged as described. You might also want to take particular note that this is the first example we've seen in which we have specified two variables in the BY statement. There is nothing that much more difficult about it... other than making sure that both of the input data sets are sorted by both the *id* and *v_date* variables... they are.

In reviewing the output, you should also note that we have not yet achieved what we set out to do, namely to create an *analysis *data set that contains only the voiding data for the patients appearing in the *patients* data set. That is, the *analysis* data set still includes the voiding data on subject 110014. The following code uses the IN= option and a subsetting IF statement to help us accomplish our task:

[24] | [25] |

The IN = *inpatients* option tells SAS to assign a value of 1 to the *inpatients* variable when an observation from the *patients* data set contributes to the current observation. Likewise, the IN = *inallvoids* option tells SAS to assign a value of 1 to the *inallvoids* variable when an observation from the *allvoids* data set contributes to the current observation. The subsetting IF statement tells SAS to write only those observations to the *analysis* data set whose value for both *inpatients* and *inallvoids* is 1, that is, only those observations that were created from observations in both the *patients* and *allvoids* data sets. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the *analysis* data set now contains only the voiding data for the patients appearing in the *patients* data set.

It is often the case that input data sets contain many more variables than you want in your final merged data set. Just as you can use the DROP= and KEEP= DATA step options to drop or keep a subset of variables when reading SAS data sets, you can use the DROP= and KEEP= options to drop or keep a subset of variables when merging SAS data sets. Let's take a look at an example!

**Example 16.10**. The following program prints a subset of the observations in the ICDB Study's background (*back* [26]), pain and urgency (*purg* [27]), and family history (*fhx* [28]) data sets:

[29] | [30] |

In each case, the PRINT procedure's OBS=5 option tells SAS to print just the first five observations of each data set. The second PRINT procedure contains an extra direction to SAS to print only those observations for which the *v_type* variable equals 0. Before you can run the program, you'll need to right-click the data set links... *back [26]*, * purg [27]*, and *fhx [28]*... so that you can save the data sets to a convenient location on your computer. After doing so, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data sets. Then, run the program, and review the output so that you can familiarize yourself with each data set.

As you can see, the data sets share three variables — *subj*, *v_type*, and *v_date*, and then each data set has its own set of unique variables. Our interest in this example is to merge the three data sets into a new data set called *icdbdata* that contains the common *subj* variable, the *b_date* and *sex* variables from the *back* data set, the *purg_1* and *purg_2* variables from the *purg* data set, and the *fhx_1* and *fhx_2* variables from the *fhx* data set. That is, when all is said and done, this is what the first ten observations of the *icdbdata* set should look like:

The following code does the trick for us:

[31] | [32] |

At a global level, the DATA step tells SAS to merge the (permanent) *back*, *purg*, and *fhx *data sets by the *subj* variable. The data set options give SAS more specific directions about what to do while match-merging the data sets. Specifically:

- The KEEP= option attached to the
*icdb.back*data set tells SAS to keep just three variables —*subj*,*b_date*, and*sex*— from the*icdb.back*data set. - The WHERE= option attached to the
*icdb.purg*data set tells SAS to select the observations for which the*v_type*variable equals 0. - The KEEP= option attached to the
*icdb.purg*data set tells SAS to keep four variables —*subj*,*v_type*,*purg_1*, and*purg_2*— from the*icdb.purg*data set. Although we don't want the*v_type*variable in the final*icdbdata*data set, we still need to tell SAS to keep the variable just so we can use it to select the observations for which*v_type*equals 0. - The KEEP= option attached to the
*icdb.fhx*data set tells SAS to keep just three variables —*subj*,*fhx**_1*, and*fhx_2*— from the*icdb.purg*data set. - Finally, the DROP= option attached to the
*icdbdata*output data set tells SAS to drop the*v_type*variable from the output data set.

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the *icdbdata* data set contains only the variables and observations that we desired.

In this lesson, we learned how to match-merge two or more SAS data sets.

The homework for this lesson will give you practice with this technique.

**Links:**

[1] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0401.sas

[2] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0401_output.gif

[3] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0402.sas

[4] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0402_output.gif

[5] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0403.sas

[6] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0403_output.gif

[7] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0404A.sas

[8] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0404A_output.gif

[9] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0404B.sas

[10] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0404B_output.gif

[11] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0405A.sas

[12] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0405A_output.gif

[13] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0405B.sas

[14] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0405B_output.gif

[15] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0406.sas

[16] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0406_output.gif

[17] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0407.sas

[18] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0407_output.gif

[19] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0408.sas

[20] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0408_output.gif

[21] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0409A.sas

[22] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0409B.sas

[23] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0409B_output.gif

[24] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0409C.sas

[25] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0409C_output.gif

[26] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/back.sas7bdat

[27] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/purg.sas7bdat

[28] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/fhx.sas7bdat

[29] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0410A.sas

[30] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0410A_output.gif

[31] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sasndata/sas_L0410B.sas

[32] https://onlinecourses.science.psu.edu/stat481/sites/onlinecourses.science.psu.edu.stat481/files/lesson04/sas_L0410B_output.gif