7.2.6 - Model Assumptions and Diagnostics Assumptions

7.2.6 - Model Assumptions and Diagnostics Assumptions

In carrying out any statistical analysis it is always important to consider the assumptions for the analysis and confirm that all assumptions are satisfied.

Let's recall the four assumptions underlying the Hotelling's T-square test.

1. The data from population i is sampled from a population with mean vector $\boldsymbol{\mu}_{i}$.
2. The data from both populations have common variance-covariance matrix $Σ$
3. Independence. The subjects from both populations are independently sampled.
Note! This does not mean that the variables are independent of one another
4. Normality. Both populations are multivariate normally distributed.

The following will consider each of these assumptions separately, and methods for diagnosing their validity.

1. Assumption 1: The data from population i is sampled from a population mean vector $\boldsymbol{\mu}_{i}$.

• This assumption essentially means that there are no subpopulations with different population mean vectors.
• In our current example, this might be violated if the counterfeit notes were produced by more than one counterfeiter.
• Generally, if you have randomized experiments, this assumption is not of any concern. However, in the current application we would have to ask the police investigators whether more than one counterfeiter might be present.
2. Assumption 2: For now we will skip Assumption 2 and return to it at a later time.

3. Assumption 3: Independence

• Says the subjects for each population were independently sampled. This does not mean that the variables are independent of one another.
• This assumption may be violated for three different reasons:
• Clustered data: If bank notes are produced in batches, then the data may be clustered. In this case, the notes sampled within a batch may be correlated with one another.
• Time-series data: If the notes are produced in some order over time, there might be some temporal correlation between notes over time. The notes produced at times close to one another may be more similar. This could result in temporal correlation violating the assumptions of the analysis.
• Spatial data: If the data were collected over space, we may encounter some spatial correlation.
Note! the results of Hotelling's T-square are not generally robust to violations of independence.
4. Assumption 4: Multivariate Normality

To assess this assumption we can produce the following diagnostic procedures:

• Produce histograms for each variable. We should look for a symmetric distribution.
• Produce scatter plots for each pair of variables. Under multivariate normality, we should see an elliptical cloud of points.
• Produce a three-dimensional rotating scatter plot. Again, we should see an elliptical cloud of points.

Note! The Central Limit Theorem implies that the sample mean vectors are going to be approximately multivariate normally distributed regardless of the distribution of the original variables.

So, in general Hotelling's T-square is not going to be sensitive to violations of this assumption.

Now let us return to assumption 2.

Assumption 2. The data from both populations have common variance-covariance matrix $Σ$.

This assumption may be assessed using Bartlett's Test.

Bartlett's Test

Suppose that the data from population i have variance-covariance matrix $\Sigma_i$; for population i = 1, 2. Need to test the null hypothesis that $\Sigma_1$ is equal to $\Sigma_2$ against the general alternative that they are not equal as shown below:

$H_0\colon \Sigma_1 = \Sigma_2$ against $H_a\colon \Sigma_1 \ne \Sigma_2$

Here, the alternative is that the variance-covariance matrices differ in at least one of their elements.

The test statistic for Bartlett's Test is given by L-prime as shown below:

$L' = c\{(n_1+n_2-2)\log{|\mathbf{S}_p|}- (n_1-1)\log{|\mathbf{S}_1|} - (n_2-1)\log{|\mathbf{S}_2|}\}$

This involves a finite population correction factor c, which is given below.

Note! In this formula, the logs are all the natural logs.

The finite population correction factor, c, is given below:

$c = 1-\dfrac{2p^2+3p-1}{6(p+1)}\left\{\dfrac{1}{n_1-1}+\dfrac{1}{n_2-1} - \dfrac{1}{n_1+n_2-2}\right\}$

It is a function of the number of variables p, and the sample sizes $n_{1}$ and $n_{2}$.

Under the null hypothesis, $H_{0}\colon \Sigma_{1} = \Sigma_{2}$, Bartlett's test statistic is approximately chi-square distributed with p(p + 1)/2 degrees of freedom. That is,

$L' \overset{\cdot}{\sim} \chi^2_{\dfrac{p(p+1)}{2}}$

The degrees of freedom are equal to the number of unique elements in the variance-covariance matrix (taking into account that this matrix is symmetric). We will reject $H_o$ at level $\alpha$ if the test statistic exceeds the critical value from the chi-square table evaluated at level $\alpha$.

$L' > \chi^2_{\dfrac{p(p+1)}{2}, \alpha}$

Using SAS

Bartlett's Test may be carried out using the SAS program as shown below:

View the video explanation of the SAS code.

The output can be downloaded here: swiss15.lst

Using Minitab

At this time Minita does not support this procedure.

Analysis

Under the null hypothesis that the variance-covariance matrices for the two populations are equal, the natural logs of the determinants of the variance-covariance matrices should be approximately the same for the fake and the real notes.

The results of Bartlett's Test are on the bottom of page two of the output.  The test statistic is 121.90 with 21 degrees of freedom; recall that p=6. The p-value for the test is less than 0.0001 indicating that we reject the null hypothesis.

The conclusion here is that the two populations of bank notes have different variance-covariance matrices in at least one of their elements. This is backed up by the evidence given by the test statistic $\left( L ^ { \prime } = 121.899 ; \mathrm { d.f. } = 21 ; p < 0.0001 \right)$. Therefore, the assumption of homogeneous variance-covariance matrices is violated.

Notes

One should be aware that, even though Hotelling's T-square test is robust to violations of assumptions of multivariate normality, the results of Bartlett's test are not robust to normality violations.  The Bartlett's Test should not be used if there is any indication that the data are not multivariate normally distributed.

In general, the two-sample Hotelling's T-square test is sensitive to violations of the assumption of homogeneity of variance-covariance matrices, this is especially the case when the sample sizes are unequal, i.e., $n_{1}$ ≠ $n_{2}$. If the sample sizes are equal then there doesn't tend to be all that much sensitivity and the ordinary two-sample Hotelling's T-square test can be used as usual.

  Link ↥ Has Tooltip/Popover Toggleable Visibility