2.7 - Goodness-of-Fit Tests: Cell Probabilities Functions of Unknown Parameters

Printer-friendly versionPrinter-friendly version

For many statistical models, we do not know the vector of probabilities π a priori, but can only specify it up to some unknown parameters. More specifically, the cell proportions may be known functions of one or more other unknown parameters.

Hardy-Weinberg problem. Suppose that a gene is either dominant (A) or recessive (a), and the overall proportion of dominant genes in the population is p. If we assume mating is random (i.e. members of the population choose their mates in a manner that is completely unrelated to this gene), then the three possible genotypes—AA, Aa, and aa—should occur in the so-called Hardy-Weinberg proportions:

genotype
proportion
no. of dominant genes
AA
π1 = p2
2
Aa
π2 = 2p(1 − p)
1
aa
π3 = (1 − p)2
0

Note that this is equivalent to saying that the number of dominant genes that an individual has (0, 1, or 2) is distributed as Bin(2, p), where the parameter p is not specified. we have to first estimate p to be able to estimate (i.e., say something about the) the unknown cell proportions in vector  π.

Number of Children (The Poisson Model). Suppose that we observe the following numbers of children in n = 100 families:

no. of children:
0
1
2
3
4+
count:
19
26
29
13
13

Are these data consistent with a Poisson distribution? Recall that if a random variable Y has a Poisson distribution with mean λ, then

\(P(Y=y)=\dfrac{\lambda^y e^{-\lambda}}{y!}\)

for y = 0, 1, 2, . . .. Therefore, under the Poisson model, the proportions given some unknown λ, are provided in the table below. For example, $\pi_1=P(Y=0)=\dfrac{\lambda^0 e^{-\lambda}}{0!}=e^{-\lambda}$.

no. of children
proportion
0
π1 = e−λ
1
π2 = λe−λ
2
π3 = λ2e−λ/2
3
π4 = λ3e−λ/6
4+
π5 = 1 − Σ4j=1 πj

In both of these examples, the null hypothesis is that the multinomial probabilities πj depend on one or more unknown parameters in a known way. In the children's example, we maybe want to know the proportion of the families in the sample population that have 2 children. In more general notation, the model specifies that:

π1 = g1(θ),
π2 = g2(θ),
...
πk = gk(θ),

where g1, g2, . . . , gk are known functions but the parameter θ is unknown (e.g., $\lambda$ in the children's example or $p$ in the genetics example). Let S0 denote the set of all π that satisfy these constraints for some parameter θ. We want to test

H0 : π ∈ S0  versus   H1 : π ∈ S,

where S denotes the probability simplex (the space) of all possible values of π. (Notice that S is a (k − 1)-dimensional space, but the dimension of S0 is the number of free parameters in θ.)

The method for conducting this test is as follows.

  1. Estimate θ by an efficient method (e.g. maximum likelihood). Call the estimate \(\hat{\theta}\).
  2. Calculate estimated cell probabilities \(\hat{\pi}=(\hat{\pi}_1,\hat{\pi}_2,\ldots,\hat{\pi}_k)\), where
  3. \(\hat{\pi}_1=g_1(\hat{\theta})\)
    \(\hat{\pi}_2=g_2(\hat{\theta})\)
    ...
    \(\hat{\pi}_k=g_k(\hat{\theta})\)

  4. Calculate the goodness-of-fit statistics \(X^2(x,\hat{\pi})\) and \(G^2(x,\hat{\pi})\). That is, calculate the expected cell counts 1\(E_1=n\hat{\pi}_1\), \(E_2=n\hat{\pi}_2\), . . ., \(E_k=n\hat{\pi}_k\) , and find
  5. \(X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}\) and \(G^2=2\sum\limits_j O_j \text{log}\dfrac{O_j}{E_j}\) as usual.

If  \(X^2(x,\hat{\pi})\) and \(G^2(x,\hat{\pi})\) are calculated as described above, then the distribution of both X2 and G2 under the null hypothesis as n → ∞, approaches χ2ν, where ν equals the number of unknown parameters under the alternative hypothesis minus the number of unknown parameters under the null hypothesis, ν = (k − 1) − d, where d = dim(θ), i.e., the number of parameters in θ.

The difference between this result and the previous one is that the expected cell counts E1, E2, . . . , Ek used to calculate X2 and G2 now contain unknown parameters. Because we need to estimate d parameters to find E1, E2, . . . , Ek, the large-sample distribution of X2 and G2 has changed; it's still a chi-squared distribution, but the degrees of freedom have dropped by d , the number of unknown parameters we first need to estimate.

Example - Number of Children, continued

Are the data below consistent with a Poisson model?

no. of children:
0
1
2
3
4+
count:
19
26
29
13
13

Let's test the null hypothesis that these data are Poisson. First, we need to estimate λ, the mean of the Poisson distribution, and thus here $d=1$. Recall that if we have an iid sample y1, y2, . . . , yn from a Poisson distribution, then the ML estimate of λ is just the sample mean, \(\hat{\lambda}=n^{-1}\sum_{i=1}^n y_i\). Based on the table above, we know that the original data y1, y2, . . . , yn contained 19 values of 0, 26 values of 1, and so on; however, we don't know the exact values of the original data that fell into the category 4+. It is also not easy to find MLE of λ in this situation without involved numerical computations. To make matters easy,  suppose for now that of the 13 values that were classified as 4+, ten were equal to 4 and three were equal to 5. Then the ML estimate of λ is, therefore,

\(\hat{\lambda}=\dfrac{19(0)+26(1)+29(2)+13(3)+10(4)+3(5)}{100}=1.78\)

Under this estimate of λ, the expected counts for the first four cells (0, 1, 2, and 3 children, respectively) are

E1 = 100e−1.78 = 16.86,
E2 = 100(1.78)e−1.78 = 30.02,
E3 = 100(1.78)2e−1.78/2 = 26.72,
E4 = 100(1.78)3e−1.78/6 = 15.85.

The expected count for the 4+ cell is most easily found by noting that Σj Ej = n, and thus

E5 = 100 − (16.86 + 30.02 + 26.72 + 15.85) = 10.55.

This leads to:

X2 = 2.08 and G2 = 2.09.

Since the general multinomial model here has k − 1 = 4 parameters, where k is the number of cell that is $\pi_j$'s,  and the Poisson model has just one parameter $\lambda$, the degrees of freedom for this test are ν = 4 − 1 = 3, and the p-values are

\(P(\chi^2_3\geq2.08)=0.56\)

\(P(\chi^2_3\geq2.09)=0.55\)

The Poisson model seems to fit well; there is no evidence that these data are not Poisson. Below is an example of how to do these computations using R and SAS.

R logo Here is how to fit the Poisson Model in R using the following code :

The function dpois() calculates Poisson probabilities. You can also get the X2 in R by using function chisq.test(ob, p=pihat) in the above code, but notice in the output below, that the degrees of freedom, and thus the p-value, are not correct:

Chi-squared test for given probabilities

data: ob
X-squared = 2.0846, df = 4, p-value = 0.7202

You can use this X2 statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.

SAS logoHere is how we can do this goodness-of-fit test in SAS, by using pre-calculated proportions (pihats). The TESTP option specifies expected proportions for a one-way table chi-square test. But notice in the output below, that the degrees of freedom, and thus the p-value, are not correct:

     Chi-Square Test
for Specified Proportions
-------------------------
Chi-Square         2.0892
DF                      4
Pr > ChiSq         0.7194

You can use this X2 statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.

children.sas (program - text file)
children.lst (output - text file)