This section is dedicated to studying the appropriateness of the model. Do the model assumptions hold? This is done via various diagnostics, such assessing the distribution of the residuals.
Let us begin with a bit of a review of Linear Regression diagnostics. The standard linear regression model is given by:
\(y_i \sim N(\mu_i,\sigma^2)\)
\(\mu_i =x_i^T \beta\)
The two crucial features of this model are
The most common diagnostic tool is the residuals, the difference between the estimated and observed values of the dependent variable. There are other useful regression diagnostics, e.g. measures of leverage and influence, but for now our focus will be on the estimated residuals.
The most common way to check these assumptions is to fit the model and then plot the residuals versus the fitted values \(\hat{y}_i=x_i^T \hat{\beta}\) .
it suggests a failure of the mean model; the true relationship between μ_{i} and the covariates might not be linear.
or like this,
then the variance V (y_{i}) is not constant but changes as the mean μ_{i} changes.
The logistic regression model says that the mean of y_{i} is
\(\mu_i=n_i \pi_i\)_{ }
where
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=x_i^T \beta\)
and that the variance of y_{i} is
\(V(y_i)=n_i \pi_i(1\pi_i)\).
After fitting the model, we can calculate the Pearson residuals [2]
\(r_i=\dfrac{y_i\hat{\mu}_i}{\sqrt{\hat{V}(y_i)}}=\dfrac{y_in_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1\hat{\pi}_i)}}\)
or the deviance residuals [2]. If the n_{i}'s are "large enough", these act something like standardized residuals in linear regression. To see what's happening, we can plot them against the linear predictors,
\(\hat{\eta}_i=\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=x_i^T \hat{\beta}_i\)
which are the estimated logodds of success, for cases i = 1, . . . , N.
then it suggests that the mean model
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=x_i^T \beta\) (2)
has been misspecified in some fashion. That is, it could be that one or more important covariates do not influence the logodds of success in a linear fashion. For example, it could be that a covariate X ought to be replaced by a transformation such as \(\sqrt{X}\) or logX, or by a pair of variables X and X^{2}, etc.
To see whether individual covariates ought to enter in the logit model in a nonlinear fashion, we could plot the empirical logits
\(\text{log}\left(\dfrac{y_i+1/2}{n_iy_i+1/2}\right)\) (3)
versus each covariate in the model.
Changing the link function will change the interpretation of the coefficients entirely; the β_{j}'s will no longer be logodds ratios. But, depending on what the link function is, they might still have a nice interpretation. For example, in a model with a log link, \(\text{log}\pi_i=x_i^T \beta\), an exponentiated coefficient exp(β_{j}) becomes a relative risk.
Hinkley (1985) suggests a nice, easy test to see whether the link function is plausible:
\(\hat{\eta}_i=x_i^T \hat{\beta}\)
A significant result indicates that the link function is misspecified. A nice feature of this test is that it applies even to ungrouped data (n_{i}'s equal to one), for which residual plots are uninformative.
Suppose that the residual plot shows nonconstant variance as we move from left to right:
Another way to detect nonconstancy of variance is to plot the absolute values of the residuals versus the linear predictors and look for a nonzero slope:
Nonconstant variance in the Pearson residuals means that the assumed form of the variance function,
\(V(y_i)=n_i \pi_i (1\pi_i)\)
is wrong and cannot be corrected by simply introducing a scale factor for overdispersion. Overdispersion and changes to the variance function will be discussed later.
The SAS online help documentation provides the following quantal assay dataset. In this table, x_{i} refers to the logdose, n_{i} is the number of subjects exposed, and y_{i} is the number who responded.
x_{i}

y_{i}

n_{i}

2.68

10

31

2.76

17

30

2.82

12

31

2.90

7

27

3.02

23

26

3.04

22

30

3.13

29

31

3.20

29

30

3.21

23

30

If we fit a simple logistic regression model, we will find that the coefficient for x_{i} is highly significant, but the model doesn't fit. The plot of Pearson residuals versus the fitted values resembles a horizontal band, with no obvious curvature or trends in the variance. This seems to be a classic example of overdispersion.
Since there's only a single covariate, a good place to start is to plot the empirical logits as defined in equation (3) above versus X.
This is basically the same thing as a scatterplot of Y versus X in the context of ordinary linear regression. This plot becomes more meaningful as the n_{i}'s grow. With ungrouped data (all n_{i} = 1), the empirical logits will only take two possible values—log(1/3) and log 3/1—and the plot will not be very useful.
Here is the SAS program file assay.sas [3]:
The relationship between the logits and X seems linear. Let's fit the logistic regression and see what happens.
See assay1.sas [4]:
In plot reschi * xb, reschi are the Pearson residuals, and xb the linear predictor.
The output reveals that the coefficient of X is highly significant, but the model does not fit.
Here is the R code assay.R [5] that corresponds to the SAS program assay1.sas:
With plot.lm(result), R will produce four diagnostic plots, including a residual plot, a QQ plot, a scalelocation plot, and a residual vs leverage plot as well.
The output reveals that the coefficient of X is highly significant, but the model does not fit.
Here is the residual plot from R output:
The residuals plots in both SAS and R above do not show any obvious curvature or trends in the variance. And there are no other predictors that are good candidates for inclusion in the model. (It can be shown that a quadratic term for logconcentration will not improve the fit.) So it is quite reasonable to attribute the problem to overdispersion.
Links:
[1] https://www.dynamicdrive.com
[2] https://onlinecourses.science.psu.edu/stat504/node/220
[3] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/assay.sas
[4] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/assay1.sas
[5] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/assay.R