Lesson 12: Statistical Methods (2) Logistic Regression, Poisson Regression

Printer-friendly versionPrinter-friendly version

Objective:

Reading this material and following the embedded links, the student will increase their familiarity with statistical methods used in epidemiology, particularly logistic regression, Poisson regression, effect modification.

Logistic Regression

Logistic regression can describe the relationship between a categorical outcome (response variable) and a set of covariates (predictor variables). The categorical outcome may  be binary (e.g., presence or absence of disease) or ordinal (e.g., normal, mild and severe). The predictor variable(s) may be continuous or categorical.  For example, consider modeling the presence or absence of coronary heart disease using age as a predictor variable.

Table 1. Age and presence of coronary heart disease (CHD)

data

Q:  Many students are familiar with linear regression. Would linear regression model these data well? Why or why not?

A: No.  With the responses limited to 0 or 1, the error terms are not normally distributed. Nor is the error variance constant. The graph of the data does not look like a regression line, but two lines, one at 0 and another at 1.

Error terms:      If \(Y_i = 1 \Rightarrow  \epsilon_i = 1 - \beta_0 - \beta_{1xi}\)
                          If \(Y_i = 0 \Rightarrow  \epsilon_i = - \beta_0 - \beta_{1xi}\)

Instead of using the 0/1 responses, let's consider the proportion of individuals with CHD by age group.

Table 2. Prevalence (%) of CHD by age group

prevalence percent

Plot of Data from Table 2

dot plot

The plot of the proportions follows a curvilinear pattern which can be modeled using logistic regression. The logistic regression model satisfies the constraint 

\[0 \le E(Y) = \pi \le 1\]

The binomial distribution, instead of the normal distribution, is used to describe the distribution of the errors in the logistic model. 

\[\begin{align} \sigma^2 {\epsilon_i} &= \pi_i(1-\pi_i) \\
&= E(Y_i)(1-E(Y_i)) \\
&= (\beta_0 - \beta_1x_1)(1-\beta_0 - \beta_1x_1) \ \end{align}\]

Logistic function 

graph

The logistic function models the conditional probability of the response.

Logistic transformation

\[P(y|x) =\frac{e^{\alpha+\beta x}}{1+ e^{\alpha+\beta x}}\]

\[ln\left[\frac{P(y|x)}{1-P(y|x)} \right]=\alpha+\beta x\]

where \(ln\left[\frac{P(y|x)}{1-P(y|x)} \right]\) is the logit of \(P(y|x)\).

Taking the logarithm of the logistic function, the logit, results in terms that resemble a linear regression model.

Advantages of the Logit

  • Allows properties of a linear regression model to be exploited
  • The logit itself can take values between - ∞ and + ∞
  • Probability remains constrained between 0 and 1
  • The logit can be directly related to odds of disease

\[ln\left( \frac{P}{1-P} \right)=\alpha+\beta x\]

\[\frac{P}{1-P}=e^{\alpha+\beta x}\]

Interpretation of coefficient \(\beta\)

The probabilities for an individual to fall into categories of exposure to a risk factor and presence or absence of disease are defined below:

 
Exposure x
Disease y
yes
no
yes
\(P( y | x = 1 )\)
\(P( y | x = 0 )\)
no
\(1 - P( y | x = 1 )\)
\(1 - P( y | x = 0 )\)

\[\frac{P}{1-P}=e^{\alpha+\beta x}\]

\[\begin{align}Odds_{d|e} &= e^{\alpha + \beta}\\ Odds_{d|\bar{e}} &= e^{\alpha} \end{align}\]

\[\begin{align}OR &= \frac{e^{\alpha + \beta}}{e^{\alpha}} =e^{\beta} \\ ln(OR) &= \beta \end{align}\]

The odds of disease given exposure and the odds of disease among the unexposed are indicated in the middle column above. The odds ratio for the odds of disease among the exposed as compared to the odds of disease among the nonexposed simplifies to \(e^{\beta}\). 

  • \(\beta\) = increase in logarithm of the odds ratio for a one unit increase in x
  • A Wald test can be used to test of the hypothesis that \(\beta=0\)

\[\chi^2=\frac{\beta^2}{Variance(\beta)} \;\;\;\; (1df)\]

  • A confidence interval for the OR can be calculated.

\[95\% \; CI \;\; \text{for} \;\; \Theta = e^{\beta \pm 1.96 SE\beta}\]