Printer-friendly versionPrinter-friendly version

When data are collected on a pre-determined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the High-risk Drinking Example where we have high-risk drinkers versus non-high-risk drinkers. In this study there was a fixed number of trials (e.g., fixed number of students surveyed, n=1315) where the researcher counted the number of "successes" and the number of "failures" that occur. We can let X be the number of "successes" that is the number of students who are the high-risk drinkers. We can use the binomial probability distribution (i.e., binomial model), to describe this particular variable.

Binomial distributions are characterized by two parameters: n, which is fixed - this could be the number of trials or the total sample size if we think in terms of sampling, and π, which usually denotes a probability of "success". In our example this would be the probability that someone is a high-risk drinker in the population of Penn State students. Please note that some textbooks will use π to denote the population parameter and p to denote the sample estimate, whereas some may use p for the population parameters as well.  We may do both; don't be confused by this, just make sure to read carefully the specification. Once you know n and π, the probability of success, you know the mean and variance of the binomial distribution, and you know everything about that binomial distribution.

Below are the probability density function, mean and variance of the binomial variable.

\(f(x)=\dfrac{n!}{x!(n-x)!}π^x(1-π)^{n-x}\qquad \text{for }x=0,1,2,\ldots,n\)

Mean E (X) = nπ
Variance Var (X) = nπ (1 - π)

Binomial Model (distribution) Assumptions

  • Fixed n: the total number of trials/events, (or total sample size) is fixed.
  • Each event has two possible outcomes, referred to as "successes" or "failures", (e.g., each student can be either a heavy drinker or a non-heavy drinker; heavy drinker being a success here).
  • Independent and Identical Events/Trials:
    • Identical trials means that probability of success is the same for each trial.
    • Independent means that the outcome of one trial does not affect the outcome of the other, (e.g. one student being a heavy drinker or not does not affect the status of the next student, and each student has the same probability, π, of being a heavy drinker.)

Example - Heavy Drinking Students

QUESTION: What is the probability that no students are heavy drinkers, i.e., P(X = 0)?

Let's assume that π = 0.5.

\begin{align}
P(X=0|n=1315,\pi=0.5) &= \binom{n}{x} \pi^x(1-\pi)^{n-x}\\
&= \dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x}\\
&= \dfrac{1315!}{0!(1315-0)!}(0.5)^0(0.5)^{1315}\\
&= 1\cdot1\cdot(0.5)^{1315}\\
&\approx 0\\
\end{align}

 

Discuss    What's the probability that there are X = 1000 heavy drinkers in this example?


QUESTION
: What is the true population proportion of students who are high-risk drinkers at Penn State?

This is a statistical inference question that can be answered with a point estimate, confidence intervals and hypothesis tests about proportions. The likelihood function for Binomial L ; x) is a measure of how close the population proportion π is to the  data x; The Maximum Likelihood Estimate (MLE) is the most likely value for π given the observed data, and for the binomial distribution this is the sample mean,

\(\hat{\pi}=\dfrac{\sum x_i}{n}=\dfrac{x}{n}\)

and for the expected counts,

\(\hat{\mu}=\hat{\pi}\cdot n\)

Thus for our example, assuming the Binomial distribution, our "best" guess estimate of the true proportion of students who are high-risk drinkers is

\(p=\hat{\pi}=\dfrac{630}{1315}=0.48\)

Here are also the likelihood and loglikelihood graphs for our example. We can see that the peak of the likelihood is at the proportion value equal to 0.48. The loglikelihood looks quadratic which means that the large-sample normal theory should work fine, and we can use the approximate 95% confidence intervals.

plotplot

The MLE is used for statistical inference, such as testing for a specific value of π, or giving a range of likely values of π, via the 95% confidence interval. A key result here comes from understanding the properties of the sampling distribution of the sample proportion p.  

The Rule for Sample Proportion: If numerous samples of size n are taken with n large enough, the frequency curve of the sample proportions from the various samples will be approximately normal with the mean E(p)=π and variance Var(p)=π(1- π)/n . "Large enough" usually means that the number of successes and failures are not small, i.e., nπ ≥ 5, and n(1 - π) 5. The larger the sample size n, the sampling distribution of p is better approximated by a normal distribution.  Note that the sampling distribution of p is really a discrete rather than a continuous distribution, but we rely on the above described approximation for statistical inference unless we deal with small samples; for the latter case see Agresti (2007), Sec 1.4.3-1.4.5.

QUESTION: Is the population proportion of heavy-drinkers significantly different from 50%?

Large-sample hypothesis test about π

H0: π = π0 vs. HA: ππ0

The z test statistic: \(z=\dfrac{p-\pi_0}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}\) where p is the sample proportion, in our case 0.48, and π0 is the value we are testing for, 0.5.

H0: π = 0.5 vs. HA: π ≠ 0.5

The statistic z = -1.45 with two sided p-value of 0.136. Thus we do not have a very strong evidence that the proportion of high-risk drinkers is different from 50%; i.e., do not reject the null hypothesis.

Confidence interval for π

The usual (1 - α) × 100% CI, holds:

\(p\pm z_{\alpha/2}\sqrt{\dfrac{p(1-p)}{n}}\)

This interval is known as the Wald confidence interval.

For our example, the 95% CI is 0.48 ± 1.96 × 0.014 = (0.453, 0.507). We can be 95% confident that the true population proportion of students high-risk drinkers is between 0.454 and 0.506.

However, when the proportions are extremely small or large, π < 0.20 or π > 0.80, this CI does not work very well. It is better to consider the likelihood-ratio-based CI, as discussed in Lesson 1.  This interval is more complex computationally but in essence simple by evaluating the likelihood function plotted below where we are looking for all possible values of $pi_0$ for which the null hypothesis would not be rejected. In our case we get 95% CI to be (0.453, 0.506), here is a plot of that confidence interval.

plot

To do the above calculations in R and SAS, see the drinking.R and drinking.sas files below.  Also, watch the viewlets that will walk you through how these program works.

SAS logoHere is the SAS program drinking.sas.

And click on the 'Inspect' icon to see a walk through of the last part of this program.

 

R LogoHere is the R program drinking.R.

And, here is a walk-through of this program.

If you encounter an error with the sink() function, please see the following page with  support materials for R.

 

The third alternative, also likelihood-based confidence interval, known as the Score confidence interval in essence is looking for ALL π0 values that yield the desired test statistics, e.g., for 95% CI, zα/2 = ± 1.96. To simplify, we need to solve the following equation:

\(\dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=z_{\alpha/2}\)

which is the same as:

\(\dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=z_{\alpha/2} \text{ and } \dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=-z_{\alpha/2}\)

To do this, we can solve the following quadratic equation:

\(\left(1+\dfrac{z^2_{\alpha/2}}{n}\right)\pi^2_0+\left(-2p-\dfrac{z^2_{\alpha/2}}{n}\right)\pi_0+p^2=0\)

\(a\pi^2_0+b\pi_0+c=0\)

\(\pi_0=\dfrac{-b\pm \sqrt{b^2-4ac}}{2a}\)

In this example, the 95% Score CI is the same as the Wald CI, $(0.453, 0.507)$. Notice that the only difference between the Wald CI and the Score CI, is the standard error where in the former its calculated using the value of sample proportion p and in the latter the null value of π0.

Traditionally, most software packages will give you the Wald CI, but nowadays we are starting to see the score and the likelihood-ratio ones too.

Margin of error (MOE)

The margin of error is the standard error times the appropriate multiplier for the desired confidence level. So the 95% CI is p ± MOE, and the width of the confidence interval is twice the MOE; see the CNN example for a review. Confidence interval is the widest when π0 = 0.5. This knowledge is useful in determining sample size for given conditions.

Sample Size

Often we are interested in knowing what sample size is needed for a specific margin of error for a population proportion. For example, how large a sample would we need such that the 99% confidence interval has MOE e. Solving the following for n:

\(e=2.575\sqrt{\pi(1-\pi)/n}\)

\(n=(2.575)^2 \pi(1-\pi)/e^2\)

Since π is unknown, take π = 0.5 to get the largest possible sample size. This will guarantee our MOE is not exceeded regardless what sample proportion we end up with. For example, if the required MOE is 3%

\(n=(2.575)^2(0.5)(0.5)/(0.03)(0.03)=1841.84\)

which is rounded up to 1842.

For additional simple Binomial distribution calculations see Lesson 0 and SimpleBinomialCal.R code with its viewlet that walks you through the code.