15.3 - Concatenating Two or More Data Sets

Printer-friendly versionPrinter-friendly version

To concatenate two or more SAS data sets means to stack one "on top" of the other into a single SAS data set. For example, suppose the data set store1 contains three variables, store (number), day (of the week), and sales (in dollars):

Image of store1 data set

and the data set store2 contains the same three variables:

Image of store2 data set

Then, when we concatenate the two data sets, we get what I like to call a "tall" data set:

Image of concatenated data sets

in which the data sets are stacked on top of each other. Note that the number of observations in the new data set is the sum of the numbers of observations in the original data sets. To concatenate SAS data sets, you simplify specify a list of data set names in one SET statement.

Example 15.6. The following program concatenates the store1 and store2 data sets to create a new "tall" data set called bothstores:

Note that the input data sets — store1 and store2 — contain the same variables — Store, Day, and Sales — with identical attributes. In the third DATA step, the DATA statement tells SAS to create a new data set called bothstores, and the SET statement tells SAS that the data set should contain first the observations from store1 and then the observations from store2. Note that although we have specified only two input data sets here, the SET statement can contain any number of input data sets.

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS did indeed concatenate the store1 and store2 data sets to make one "tall" data set called bothstores. You might then want to edit the SET statement so that store1 follows store2, and re-run the SAS program to see that then the contents of store1 follow the contents of store2 in the bothstores data set.

In general, a data set that is created by concatenating data sets contains all of the variables and all of the observations from all of the input data sets. Therefore, the number of variables the new data set contains always equals the total number of unique variables among all of the input data sets. And, the number of observations in the new data set is the sum of the numbers of observations in the input data sets. Let's return to the contrived example we've used throughout this lesson.

Example 15.7. The following program concatenates the one and two data sets to create a new "tall" data set called onetopstwo:

As you review the first two DATA steps, in which SAS reads in the respective one and two data sets, note that the total number of unique variables is four — ID, VarA, VarB, and VarC. The total number of observations among the two input data sets is 3 + 2 = 5. Therefore, we can expect the concatenated data set onetopstwo to contain four variables and five observations. Launch and run the SAS program, and review the output to convince yourself that SAS did grab first all of the variables and all of the observations from the one data set and then all of the variables and all of the observations from the two data set. As you can see, to make it all work out okay, observations arising from the one data set have missing values for VarC, and observations from the two data set have missing values for VarA.

When Variable Attributes Differ

As you know, variable attributes include the type of variable (character vs. numeric), the informat (how the variable is read in) and format (how its values are printed) of a variable, the length of the variable, and the label (how its variable name is printed) of a variable. Concatenating data sets when variable attributes differ across the input data sets may pose problems for SAS (and therefore you):

  • If the data sets you name in the SET statement contain variables with the same names and types, you can concatenate the data sets without modification.
  • If the variable types differ, you must modify one or more of the data sets before concatenating them. SAS will not concatenate the data sets until you do.
  • If the lengths, formats, informats or labels differ, you may want to modify one or more of the data sets before concatenating them. SAS will concatenate the data sets; you may just not like the results.

Example 15.8. If a variable is defined as numeric in one data set named in the SET statement and as character in another data set, SAS issues an error message and will not concatenate the data sets. The following program attempts to concatenate the store3 and store4 data sets, when the Store variable is defined as character in the store3 data set, but as numeric in the store4 data set:

Launch and run the SAS program, and review the log window. You can see that SAS balks at the idea of concatenating the two data sets and gives us a "dead in the water" error message:

In order to concatenate the store3 and store4 data sets successfully, we need to change either the Store character variable in the store3 data set to numeric or the Store numeric variable in the store4 data set to character. As you know from our work in Stat 480, to perform an explicit character-to-numeric conversion of the Store variable in the store3 data set, we'd have to use the INPUT function. Alternatively, we could use the PUT function to perform an explicit numeric-to-character conversion of the Store variable in the store4 data set. That's what the following DATA step does:

If you launch and run the SAS program, and review the output from the CONTENTS procedures, you can see that now the Store variable in the store3 data set is defined as character, as is the Store variable in the store4 data set.

How does the conversion work? Well, of course, it's all because of the PUT function. The general form of the PUT function is:

        PUT(source,format)

where source is a numeric variable, constant or expression to be converted to a character value, and format is a valid format that matches the data type of the source variable. In our example, Store is the numeric variable that we want to convert to a character variable, and since the values for Store are one-digit numbers, the numeric format 1. is an appropriate format to specify.

The assignment statement in our program:

        Store2 = put(Store, 1.);

thus tells SAS to convert the numeric values in Store to character values and to store the results in a character variable called Store2. A little detail issue here ... if we do not use a different variable name than Store on the left side of the equation, SAS will be forced to do an automatic numeric-to-character conversion rather than the explicit numeric-to-character conversion that we desire. The one other statement in the DATA step, as well as the RENAME= data set option, is necessary just because of this naming issue. The DROP statement tells SAS to drop the Store variable, since we now have the desired character variable Store2. And, the RENAME= option tells SAS to rename the Store2 variable back to Store, just so it matches the original name of the variable.

Now that we've seen both the PUT function and INPUT function in action, it pays to summarize:

  • To perform an explicit numeric-to-character conversion, use the PUT(source, format) with a numeric source and a numeric format.
  • To perform an explicit character-to-numeric conversion, use the INPUT(source, informat) function with a character source and a numeric informat.

To remember which function requires a format versus an informat, note that the INPUT function requires and informat.

Ahhh... but back to the task on hand. Finally, the following program allows us to accomplish our original goal of concatenating the store3 and store4 data sets:

Simple enough! Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS successfully concatenated store3 and store4 into a data set called bothstores2.

Now that we've taken a look at what we need to do when we want to concatenate data sets containing the same variables but with different variable types, let's address a few more attributes. When data sets are concatenated with the SET statement, the formats, informats, and labels associated with the variables in the new data set are determined by the following rule:

If two or more data sets define different formats, informats, or labels for the same variable, the variable in the new data set takes the attribute from the first data set in the SET statement that contains the variable with the attribute.

Example 15.9. The following program creates two data sets — store5 and store6 — that intentionally contain different labels and different formats for the Sales variable:

As you can see, the first DATA step tells SAS to set the format and label for the Sales variable as comma5. and Total Sales, respectively. And, the second DATA step tells SAS to set the format and label for the Sales variable as dollar6. and Sales for Day, respectively. Launch and run the SAS program, and review the output from the CONTENTS procedures to convince yourself that the formats and labels for the two versions of the Sales variable have been assigned as described.

Now, the following program tells SAS to store first the observations from the store5 data set, followed by the observations from the store6 data set, into a new data set called bothstores3:

Recall the rule that if two data sets explicitly define different formats and labels for the same variable, then the variable in the new data set takes the attribute from the first data set in the SET statement that explicitly defines the attribute. That said, we should expect the Sales variable in the bothstores3 data set to take on the attributes of Sales in the store5 data set. Specifically, Sales should be formatted as comma5. and labeled as Total Sales. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the store5 and store6 data sets. That is, the program tells SAS to store first the observations from the store6 data set, followed by the observations from the store5 data set, into a new data set called bothstores4:

According to the rule, we should expect the Sales variable in the bothstores4 data set to take on the attributes of Sales in the store6 data set. Specifically, Sales should be formatted as dollar6. and labeled as Sales for Day. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

And, now onto one last attribute, namely that of variable lengths. When the SET statement is used to concatenate data sets in which the same variable is assigned different lengths, the lengths of the variables in the new data set are determined by the following rule:

If two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement.

Example 15.10. The following program creates two data sets — store7 and store8 — that intentionally contain different lengths for the numeric variable Store and the character variable Day:

As you can see, the LENGTH statement in the first DATA step tells SAS to set the length for the numeric variable Store to 4 in the store7 data set. Since the character variable Day is not explicitly mentioned in a LENGTH statement, SAS sets the length of Day in the store7 data set to 3, the number of columns specified in the INPUT statement. And, since neither the numeric variable Store nor the character variable Day are explicitly mentioned in a LENGTH statement in the second DATA step, SAS sets the length of Store in store8 to 8 by default, and the length of Day to 1, the number of columns specified in the INPUT statement. Launch and run the SAS program, and review the output from the CONTENTS procedures to convince yourself that the lengths for the two versions of the Store and Day variables have been assigned as described.

Now, the following program tells SAS to store first the observations from the store7 data set, followed by the observations from the store8 data set, into a new data set called bothstores5:

Recall the rule that if two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement. That said, we should expect the Store variable in the bothstores5 data set to take on the attributes of Store in the store7 data set. Specifically, Store should be assigned a length of 4. Likewise, we should expect the Day variable in the bothstores5 data set to take on the attributes of Day in the store7 data set. Specifically, Day should be assigned a length of 3. Launch and run the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the store7 and store8 data sets. That is, the program tells SAS to store first the observations from the store8 data set, followed by the observations from the store7 data set, into a new data set called bothstores6:

According to the rule, we should expect the Store variable in the bothstores6 data set to take on the attributes of Store in the store8 data set. Specifically, Store should be assigned a length of 8. Likewise, we should expect the Day variable in the bothstores6 data set to take on the attributes of Day in the store8 data set. Specifically, Day should be assigned a length of 1. Launch and run the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected. You should note in particular how SAS is forced to truncate the values of the Day variable in order to get them to fit within their assigned length of 1.