Published on *STAT 414 / 415* (https://onlinecourses.science.psu.edu/stat414)

The methods of the last page, in which we derived a formula for the sample size necessary for estimating a population proportion *p* work just fine when the population in question is very large. When we have smaller, finite populations, however, such as the students in a high school or the residents of a small town, the formula we derived previously requires a slight modification. Let's start, as usual, by taking a look at an example.

A researcher is studying the population of a small town in India of *N* = 2000 people. She's interested in estimating *p* for several yes/no questions on a survey.

How many people *n* does she have to randomly sample (without replacement) to ensure that her estimates \(\hat{p}\) are within *ε* = 0.04 of the true proportions *p*?

**Solution.** We can't even begin to address the answer to this question until we derive a confidence interval for a proportion for a small, finite population!

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\) |

**Proof. **We'll use the example above, where possible, to make the proof more concrete. Suppose we take a random sample, *X*_{1}, *X*_{2}, ..., *X*_{n}, without replacement, of size *n* from a population of size *N*. In the case of the example, *N* = 2000. Suppose also, unknown to us, that for a particular survey question there are *N*_{1} respondents who would respond "yes" to the question, and therefore *N*−*N*_{1} respondents who would respond "no." That is, our small finite population looks like this:

If that's the case, the true proportion (but unknown to us) of yes respondents is:

\(p=P(Yes)=\dfrac{N_1}{N}\)

while the true proportion (but unknown to us) of no respondents is:

\(1-p=P(No)=1-\dfrac{N_1}{N}=\dfrac{N-N_1}{N}\)

Now, let *X* denote the number of respondents in the sample who say yes, so that:

\(X=\sum\limits_{i=1}^n X_i\)

if *X*_{i }= 1 if respondent *i* answers yes, and *X*_{i} = 0 if respondent *i* answers no. Then, the proportion in the sample who say yes is:

\(\hat{p}=\dfrac{\sum\limits_{i=1}^n X_i}{n}\)

Then, \(X=\sum\limits_{i=1}^n X_i\) is a hypergeometric random variable with mean:

\(E(X)=n\dfrac{N_1}{N}=np\)

and variance: $$Var(X)=n{N_1\over N}\left(1-{N_1\over N}\right) \left({N-n\over N-1}\right)=np(1-p)\left({N-n\over N-1}\right)$$

It follows that \(\hat{p}=X/n\) has mean \(E(\hat{p})=p\) and variance:

\(Var(\hat{p})=\dfrac{p(1-p)}{n}\left(\dfrac{N-n}{N-1}\right)\)

Then, the Central Limit Theorem tells us that:

\(\dfrac{\hat{p}-p}{\sqrt{\dfrac{p(1-p)}{n} \left(\dfrac{N-n}{N-1}\right) }}\)

follows an approximate standard normal distribution. Now, it's just a matter of doing the typical confidence interval derivation, in which we start with a probability statement, manipulate the quantity inside the parentheses, and substitute sample estimates where necessary. We've done that a number of times now, so skipping all of the details here, we get that an approximate (1−*α*)100% confidence interval for *p* of a small population is:

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)

By the way, it is worthwhile noting that if the sample *n* is much smaller than the population size *N*, that is, if *n* << *N*, then:

\(\dfrac{N-n}{N-1}\approx 1\)

and the confidence interval for *p* of a small population becomes quite similar to the confidence interval for *p* of a large population:

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\)

A researcher is studying the population of a small town in India of *N* = 2000 people. She's interested in estimating *p *for several yes/no questions on a survey.

How many people *n* does she have to randomly sample (wihtout replacement) to ensure that her estimates \(\hat{p}\) are within *ε* = 0.04 of the true proportion *p*?

**Solution.** Now that we know the correct formula for the confidence interval for *p* of a small population, we can follow the same procedure we did for determining the sample size for estimating a proportion *p* of a large population. The researcher's goal is to estimate *p* so that the error is no larger than 0.04. That is, the goal is to calculate a 95% confidence interval such that:

\(\hat{p}\pm \epsilon=\hat{p}\pm 0.04\)

Now, we know the formula for an approximate (1−*α*)100% confidence interval for a proportion *p* of a small population is:

\(\hat{p}\pm z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} \cdot \dfrac{N-n}{N-1}}\)

So, again, we should proceed by equating the terms appearing after each of the above ± signs, and solving for *n*. That is, equate:

\(\epsilon=z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}\cdot \dfrac{N-n}{N-1}}\)

and solve for *n*. Doing the algebra yields:

\(n=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})/\epsilon^2}{\dfrac{N-1}{N}+\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{N\epsilon^2}}\)

That looks simply dreadful! Let's make it look a little more friendly to the eyes:

\(n=\dfrac{m}{1+\dfrac{m-1}{N}}\)

where *m* is defined as the sample size necessary for estimating the proportion *p* for a large population, that is, when a correction for the population being small and finite is not made. That is:

\(m=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\)

Now, before we make the calculation for our particular example, let's take a step back and summarize what we have just learned.

\(n=\dfrac{m}{1+\dfrac{m-1}{N}}\) where: \(m=\dfrac{z^2_{\alpha/2}\hat{p}(1-\hat{p})}{\epsilon^2}\) is the sample size necessary for estimating the proportion |

A researcher is studying the population of a small town in India of *N* = 2000 people. She's interested in estimating *p *for several yes/no questions on a survey.

How many people *n* does she have to randomly sample (wihtout replacement) to ensure that her estimates \(\hat{p}\) are within *ε* = 0.04 of the true proportion *p*?

**Solution.** Okay, once and for all, let's calculate this very patient researcher's sample size! Because the researcher has many different questions on the survey, it would behoove her to use a sample proportion of 0.50 in her calculations. If the maximum error *ε *is 0.04, the sample proportion is 0.5, and the researcher doesn't make the finite population correction, then she needs:

\(m=\dfrac{(1.96^2)(\frac{1}{4})}{0.04^2}=600.25\)

or 601 people to estimate *p* with 95% confidence. But, upon making the correction for the small, finite population, we see that the researcher really only needs:

\(n=\dfrac{m}{1+\dfrac{m-1}{N}}=\dfrac{601}{1+\dfrac{601-1}{2000}}=462.3\)

or 463 people to estimate *p* with 95% confidence.

The following table illustrates how the sample size *n* that is necessary for estimating a population proportion *p* (with 95% confidence) is affected by the size of the population *N*. If \(\hat{p}=0.5\), then the sample size *n* is:

This table suggests, perhaps not surprisingly, that as the size of the population *N* decreases, so does the necessary size *n* of the sample.