# 6.1 How to Use Stratified Sampling Printer-friendly version
 Unit Summary stratified sampling why use stratified sampling estimating τ estimating μ confidence intervals

In stratified sampling, the population is partitioned into non-overlapping groups, called strata and a sample is selected by some design within each stratum.

For example, geographical regions can be stratified into similar regions by means of some known variable such as habitat type, elevation or soil type. Another example might be to determine the proportions of defective products being assembled in a factory. In this case sampling may be stratified by production lines, factory, etc.

Can you think of a couple additional examples where stratified sampling would make sense? Look for opportunities when the measurements within the strata are more homogeneous.

The principal reasons for using stratified random sampling rather than simple random sampling include:

1. Stratification may produce a smaller error of estimation than would be produced by a simple random sample of the same size. This result is particularly true if measurements within strata are very homogeneous.
2. The cost per observation in the survey may be reduced by stratification of the population elements into convenient groupings.
3. Estimates of population parameters may be desired for subgroups of the population. These subgroups should then be identified.

#### Example: Average Hours Watching TV Per Week

(See p.121 of Scheaffer, Mendenhall and Ott)

An advertising firm, interested in determining how much to emphasize television advertising in a certain county decides to conduct a sample survey to estimate the average number of hours each week that households within that county watch television. The county has two towns, A and B, and a rural area C. Town A is built around a factory and most households contain factory workers with school-aged children. Town B contains mainly retirees and the rural area C are mainly farmers.

There are 155 households in town A, 62 in town B and 93 in the rural area, C. The firm decides to select 20 households from Town A, 8 households from Town B and 12 households from the rural area. The results are given in the following table:

 Town A 35, 43, 36, 39, 28, 28, 29, 25, 38, 27,26, 32, 29, 40, 35, 41, 37, 31, 45, 34 N1 = 155 Town B 27, 15, 4, 41, 49, 25, 10, 30 N2 = 62 Rural Area C 8, 14, 12, 15, 30, 32, 21, 20, 34, 7, 11, 24 N3 = 93

Here is output from Minitab that describes the data from each stratum: ( N in the output denotes numbers of data) Usually a sample is selected by some probability design from each of the L strata in the population, with selections in different strata independent of each other. The special case where from each stratum a simple random sample is drawn is called a stratified random sample.

#### Think About It!

Does it make sense to use a stratified random sample for this problem? Why or Why not?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

Notation

• L = the number of strata
• Nh = number of units in each stratum h
• nh = the number of samples taken from stratum h
• N = the total number of units in the population , i.e., N1 + N2 + ... + NL

For our "Watching TV" example the following values are:

L = 3, N1 = 155, N2 = 62, N3 = 93, N = 155 + 62 + 93 = 310

#### Estimating the Population Total

$\hat{\tau}_{st}=\sum\limits_{h=1}^L \hat{\tau}_h$

The total is from each stratum added up where $\hat{\tau}_h$ is an unbiased estimator for $\tau_h$.

Since selections in different stratum are independent, the variance is:

$Var(\hat{\tau}_{st})=\sum\limits_{h=1}^L Var(\hat{\tau}_h)$, and

$\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L \hat{V}ar(\hat{\tau}_h)$

The formula are computed differently according to the sampling scheme within each stratum. For stratified random sampling, i.e., take a random sample within each stratum:

$\hat{\tau}_h=N_h \bar{y}_h$

$\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L N_h \cdot (N_h-n_h)\cdot \dfrac{s^2_h}{n_h}$

$s^2_h=\dfrac{1}{n_h-1}\sum\limits_{i=1}^{n_h}(y_{hi}-\bar{y}_h)^2$

You can see that this turns out pretty easy to remember, and one can easily obtain the estimates for the population mean.

$\hat{\mu}_{st}=\dfrac{\hat{\tau}_{st}}{N}$
$\hat{V}ar(\hat{\mu}_{st})=\dfrac{1}{N^2}\hat{V}ar(\hat{\tau}_{st})$

For stratified random sampling:

$\bar{y}_{st}=\dfrac{1}{N} \sum\limits_{h=1}^L N_h \bar{y}_h$

$\hat{V}ar(\bar{y}_{st})=\sum\limits_{h=1}^L \left(\dfrac{N_h}{N}\right)^2 \left(\dfrac{N_h-n_h}{N_h}\right) \dfrac{s^2_h}{n_h}$

sh is the sample standard deviation of h stratum as given in Minitab.

#### Application Exercise

Consider the TV Watching example. Estimate the overall mean and variance of the estimator of mean for this example. Also estimate the total and the variance of the estimator of total for this example.

[Come up with an answer to this question and then click on the icon to reveal the solution.]

#### Confidence Intervals

When all of the stratum sizes are small, an approximate 100(1-α)% CI for τ is:

$\hat{\tau}_{st} \pm t\sqrt{\hat{V}ar(\hat{\tau}_{st})}$

However, when the stratum sample sizes are at least 30, use z to approximate t.

What is the degrees of freedom for the t used in this formula for the confidence interval? Intuitively we would want this to be, (n1-1) + (n2-1) + ... + (nL - 1), and this is correct when the variances of all strata are all the same. But when this is not the case and we can not pool the degrees of freedom, we will need to use the Satterwaithe approximation for the degrees of freedom as follows:

$d=\left(\sum\limits_{h=1}^L a_h s^2_h\right)^2/\sum\limits_{h=1}^L \dfrac{(a_h s^2_h)^2}{(n_h-1)}$

where, $a_h=\dfrac{N_h(N_h-n_h)}{n_h}$

In particular, when Nh are all equal, nh are all equal and sh2 are all equal , the d.f. = n - L.

For the TV example:

$a_1=\dfrac{N_1(N_1-n_1)}{n_1}=\dfrac{155(155-20)}{20}=1046.25$

$a_2=\dfrac{N_2(N_2-n_2)}{n_2}=\dfrac{62(62-8)}{8}=418.5$

$a_3=\dfrac{N_3(N_3-n_3)}{n_3}=\dfrac{93(93-12)}{12}=627.75$

\begin{align}
d&= \dfrac{(a_1s^2_1+a_2s^2_2+a_3s^2_3)^2}{\dfrac{(a_1s^2_1)^2}{n_1-1}+\dfrac{(a_2s^2_2)^2}{n_2-1}+\dfrac{(a_3s^2_3)^2}{n_3-1}}\\
&= \dfrac{(1046.5\cdot(5.95)^2+418.5\cdot(15.25)^2+627.75\cdot(9.36)^2)^2}{\dfrac{(1046.5\cdot(5.95)^2)^2}{20-1}+\dfrac{(418.5\cdot(15.25)^2)^2}{8-1}+\dfrac{(627.75\cdot(9.36)^2)^2}{12-1}}\\
&=21.09\\
\end{align}

#### Application Exercise

Provide a 95% CI for μ and also a 95% CI for τ.

[Come up with an answer to this question and then click on the icon to reveal the solution.]

### Using R

Here is the code for R for this example:

Datafile:  TVhour.txt
R code:  Chapter6_TVhour.R.txt