6.2 - Binary Logistic Regression with a Single Categorical Predictor

Printer-friendly versionPrinter-friendly version

Key Concepts

  • Binary Logistic Regression for 2 × J tables
  • Model Fit and Parameter Estimation & Interpretation
  • Link to test of independence 


  • Understand the basic ideas behind modeling categorical data with binary logistic regression.
  • Understand how to fit the model and interpret the parameter estimates, especially in terms of odds and odd ratios.


Binary logistic regression estimates the probability that a characteristic is present (e.g. estimate probability of "success") given the values of explanatory variables, in this case a single categorical variable ; π = Pr (Y = 1|X = x). Suppose a physician is interested in estimating the proportion of diabetic persons in a population. Naturally she knows that all sections of the population do not have equal probability of ‘success’, i.e. being diabetic. Older population, population with hypertension, individuals with diabetes incidence in family are more likely to have diabetes. Consider the predictor variable X to be any of the risk factor that might contribute to the disease. Probability of success will depend on levels of the risk factor.


  • Let Y be a binary response variable
  • Yi = 1 if the trait is present in observation (person, unit, etc...) i
    Yi = 0 if the trait is NOT present in observation i

  • X = (X1, X2, ..., Xk) be a set of explanatory variables which can be discrete, continuous, or a combination. xi is the observed value of the explanatory variables for observation i. In this section of the notes, we focus on a single variable X.


\[\pi_i=Pr(Y_i=1|X_i=x_i)=\dfrac{\text{exp}(\beta_0+\beta_1 x_i)}{1+\text{exp}(\beta_0+\beta_1 x_i)}\]


&= \beta_0+\beta_1 x_i\\
&= \beta_0+\beta_1 x_{i1} + \ldots + \beta_k x_{ik}\\


  • The data Y1, Y2, ..., Yn are independently distributed, i.e., cases are independent.
  • Distribution of Yi is Bin(ni, πi), i.e., binary logistic regression model assumes binomial distribution of the response. The dependent variable does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)
  • Does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the logit of the response and the explanatory variables; logit(π) = β0 + βX.
  • Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
  • The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
  • Errors need to be independent but NOT normally distributed.
  • It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
  • Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.

Model Fit:

  • Overall goodness-of-fit statistics of the model; we will consider:
    1. Pearson chi-square statistic, X2
    2. Deviance, G2 and Likelihood ratio test and statistic, ΔG2
    3. Hosmer-Lemeshow test and statistic
  • Residual analysis: Pearson, deviance, adjusted residuals, etc...
  • Overdispersion

Parameter Estimation:

The maximum likelihood estimator (MLE) for (β0, β1) is obtained by finding \((\hat{\beta}_0,\hat{\beta}_1)\) that maximizes:

\(L(\beta_0,\beta_1)=\prod\limits_{i=1}^N \pi_i^{y_i}(1-\pi_i)^{n_i-y_i}=\prod\limits_{i=1}^N \dfrac{\text{exp}\{y_i(\beta_0+\beta_1 x_i)\}}{1+\text{exp}(\beta_0+\beta_1 x_i)}\)

In general, there are no closed-form solutions, so the ML estimates are obtained by using iterative algorithms such as Newton-Raphson (NR), or Iteratively re-weighted least squares (IRWLS). In Agresti (2013), see section 4.6.1 for GLMs, and for logistic regression, see sections 5.5.4-5.5.5.

More Advanced material (not required):

Brief overview of Newton-Raphson (NR).  Suppose we want to maximize a loglikelihood $l(\theta)$ with respect to a parameter $\theta=(\theta_1,\ldots,\theta_p)^T$. At each step of NR, the current estimate $\theta^{(t)}$ is updated as

\theta^{(t+1)} \,=\,\theta^{(t)} \,+\,
\left[ -l^{\prime\prime}(\theta^{(t)})\right]^{-1}\,
where $l^\prime(\theta)$ is the vector of first derivatives $ l^\prime(\theta)\,=\,\left(\partial l/\partial\theta_1,\ldots,\partial l/\partial\theta_p\right)^T$ (also called the score vector), and $l^{\prime\prime}(\theta)$ is the matrix of second derivatives (also called the Hessian). That is, $l^{\prime\prime}(\theta)$ is the $p\times p$ matrix with $(j,k)$th element equal to $\partial^2 l/\partial\theta_j\partial\theta_k$. We repeat this step until convergence, when $\theta^{(t+1)}\approx\theta^{(t)}$. Upon convergence, the inverse of the Hessian provides an estimated covariance matrix,
\hat{V}(\hat{\theta}) \,=\,
\left[- l^{\prime\prime}(\hat{\theta})\right]^{-1}
(the inverse of the ``observed information''), whose diagonal elements give the estimated variance, that is the standard errors, of the parameter estimates.

Applying NR to logistic regression. For the logit model with $p-1$ predictors $X$, $\beta=(\beta_0, \beta_1, \ldots, \beta_{p-1})$, the likelihood is

L(\beta) & = &
\prod_{i=1}^N \frac{n_i!}{y_i!(n_i-y_i)!}\,
& \propto & \prod_{i=1}^N
& \propto & \prod_{i=1}^N
e^{x_i^T\!\beta y_i}\,

so the loglikelihood is $
l(\beta) \,=\,
x_i^T\!\beta y_i \, -\,
\sum_{i=1}^N n_i\,\log\left(1 + e^{x_i^T\!\beta}\right).$

The first derivative of $x_i^T\beta$ with respect to $\beta_j$ is $x_{ij}$, thus
\frac{\partial l}{\partial\beta_j} & = &
\sum_{i=1}^N y_i x_{ij} \, - \,
\sum_{i=1}^N n_i
& = &
(y_i - \mu_i)x_{ij},
where $\mu_i = E(y_i) = n_i\pi_i$.

The second derivatives used in computing the standard errors of the parameter estimates, $\hat{\beta}$, are

\begin{eqnarray} \frac{\partial^2 l}{\partial\beta_j\partial\beta_k}
& = &
-\sum_{i=1}^N n_ix_{ij}\,\frac{\partial}{\partial\beta_k}
& = &
-\sum_{i=1}^N n_i \pi_i(1-\pi_i)x_{ij}x_{ik}\end{eqnarray}.

For reasons including numerical stability and speed, it is generally advisable to avoid computing matrix inverses directly.  Thus in many implementations, clever methods are used to obtain the required information without directly constructing the inverse, or even the Hessian.


Interpretation of Parameter Estimates:

  • exp0) = the odds that the characteristic is present in an observation i when Xi = 0, i.e., at baseline.
  • exp1) = for every unit increase in Xi1, the odds that the characteristic is present is multiplied by exp1). This is similar to simple linear regression but instead of additive change it is a multiplicative change in rate. This is an estimated odds ratio.

\(\dfrac{\text{exp}(\beta_0+\beta_1(x_{i1}+1))}{\text{exp}(\beta_0+\beta_1 x_{i1})}=\text{exp}(\beta_1)\)

In general, the logistic model stipulates that the effect of a covariate on the chance of "success" is linear on the log-odds scale, or multiplicative on the odds scale.

  • If βj > 0, then expj) > 1, and the odds increase.
  • If βj < 0,then expj) < 1, and the odds decrease.

Inference for Logistic Regression:

  • Confidence Intervals for parameters
  • Hypothesis testing
  • Distribution of probability estimates


Example - Student Smoking

The table below classifies 5375 high school students according to the smoking behavior of the student (Z) and the smoking behavior of the student's parents (Y ). We saw this example in Lesson 3 (Measuring Associations in I × J tables, smokeindep.sas and smokeindep.R). 

How many parents smoke?
Student smokes?
Yes (Z = 1)
No (Z = 2)
Both (Y = 1)
One (Y = 2)
Neither (Y = 3)

The test for independence yields X2 = 37.6 and G2 = 38.4 with 2 df (p-values are essentially zero), so Y and Z are related. It is natural to think of Z as a response and Y as a predictor in this example. We might be interested in exploring the dependency of student's smoking behavior on neither parent smoking versus at least one parent smoking. Thus we can combine or collapse the first two rows of our 3 × 2 table and look at a new 2 × 2 table:

Student does
not smoke
1–2 parents smoke
Neither parent smokes

For the chi-square test of independence, this table has X2 = 27.7, G2 = 29.1, p-value ≈ 0, and theta hat = 1.58. Therefore, we estimate that a student is 58% more likely, on the odds scale, to smoke if he or she has at least one smoking parent than none.

But what if:

  • we want to model the "risk" of student smoking as a function of parents' smoking behavior.
  • we want to describe the differences between student smokers and nonsmokers as a function of parents smoking behavior via descriptive discriminate analyses.
  • we want to predict probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, e.g. what is the probability that a student is a smoker given that neither of his/her parents smokes.
  • we want to predict that a student is a smoker given that neither of his/her parents smokes, i.e. probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, we want to classify new students into "smoking" or "nonsmoking" group depending on parents smoking behavior.
  • we want to develop a social network model, adjust for "bias", analyze choice data, etc...

These are just some of the possibilities of logistic regression, which cannot be handled within a framework of goodness-of-fit only.

Consider the simplest case:

  • Yi binary response, and Xi binary explanatory variable
  • link to 2 × 2 tables and chi-square test of independence

Arrange the data in our running example like this,

1–2 parents smoke
Neither parent smokes

where yi is the number of children who smoke, ni is the number of children for a given level of parents' smoking behaviour, and πi = P(yi = 1|xi) is the probability of a randomly chosen student i smoking given parents' smoking status.  Here i takes values 1 and 2. Thus, we claim that all students who have at least one parent smoking will have the same probability of "success", and all student who have neither parent smoking will have the same probability of "success", though for the two groups success probabilities will be different. Then the response variable Y has a binomial distribution:

\(y_i \sim Bin(n_i,\pi_i)\)

Explanatory variable X is a dummy variable such that

Xi = 0 if neither parent smokes,
Xi = 1 if at least one parent smokes.

Understanding the use of dummy variables is important in logistic regression.

Think about the following question, then click on the icon to the left display an answer.

Can you explain to someone what is meant by a "dummy variable"?

Then the logistic regression model can be expressed as:

\(\text{logit}(\pi_i)=\text{log}\dfrac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 X_i\) (1) or 

\(\pi_i=\dfrac{\text{exp}(\beta_0+\beta_1 x_i)}{1+\text{exp}(\beta_0+\beta_1 x_i)}\)  (2)

and it says that log-odds of a randomly chosen student smoking is β0 for "smoking parents" = neither, and β0 + β1 for "smoking parents" = at least one parent is smoking.