18.5 - Creating Samples

Printer-friendly versionPrinter-friendly version

Because a DO loop executes statements iteratively, it provides an easy way to select a sample of observations from a large data set. Let's take a look at an example!

Example 18.12. The following program uses an iterative DO loop and the SET statement's POINT= option to select every 100th observation from the permanent data set called stat481.log11 that contains 8,624 observations:

Let's work our way through the code. The DO statement tells SAS to start at 100, increase i by 100 each time, and to end at 8600. That is, SAS will execute the DO loop when the index variable i equals 100, 200, 300, ..., 8600.

Now the SET statement contains an option that we've not seen before, namely the POINT= option. The POINT= option tells SAS not to read the stat481.log11 data set sequentially as is done by default, but rather to read the observation number specified by the POINT= option directly from the data set. For example, when i = 100, and therefore POINT = 100, SAS reads the 100th observation in the stat481.log11 data set. And when i = 3200, and therefore POINT = 3200, SAS reads the 3200th observation in the stat481.log11 data set.

The OUTPUT statement, of course, tells SAS to write to the output data set the observation that has been selected. If we did not place the OUTPUT statement within the DO loop, the resulting data set would contain only one observation, that is, the last observation read into the program data vector.

The STOP statement, which is new to us, is necessary because we are using the POINT= option. As you know, the DATA step by default continues to read observations until it reaches the end-of-file marker in the input data. Because the POINT= option reads only specified observations, SAS cannot read an end-of-file marker as it would if the file were being read sequentially. The STOP statement tells SAS to stop processing the current DATA step immediately and to resume processing statements after the end of the current DATA step. It is the use of the STOP statement, therefore, that keeps us from sending SAS into the no-man's land of continuous looping.

Now, right-click to download and save the stat481.log11 data set in a convenient location on your computer. Launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Then, run the program and review the output from the PRINT procedure to see the selected observations. You shouldn't be surprised to see that the sample data set contains 86 observations:

as the iterative DO loop executes 8600 divided by 100, or 86 times.

Note! It is important to emphasize that the method we illustrated here for selecting a sample from a large data set has nothing random about it. That is, we selected a patterned sample, not a random sample, from a large data set. That's why this section is called Creating Samples, not Creating Random Samples. We'll learn how to select a random sample from a large data set in Stat 482.