6.2  Binary Logistic Regression with a Single Categorical Predictor
Key Concepts
Objectives

Overview
Binary logistic regression estimates the probability that a characteristic is present (e.g. estimate probability of "success") given the values of explanatory variables, in this case a single categorical variable ; π = Pr (Y = 1X = x). Suppose a physician is interested in estimating the proportion of diabetic persons in a population. Naturally she knows that all sections of the population do not have equal probability of ‘success’, i.e. being diabetic. Older population, population with hypertension, individuals with diabetes incidence in family are more likely to have diabetes. Consider the predictor variable X to be any of the risk factor that might contribute to the disease. Probability of success will depend on levels of the risk factor.
Variables:
 Let Y be a binary response variable
 X = (X_{1}, X_{2}, ..., X_{k}) be a set of explanatory variables which can be discrete, continuous, or a combination. x_{i} is the observed value of the explanatory variables for observation i. In this section of the notes, we focus on a single variable X.
Y_{i} = 1 if the trait is present in observation (person, unit, etc...) i
Y_{i} = 0 if the trait is NOT present in observation i
Model:
\[\pi_i=Pr(Y_i=1X_i=x_i)=\dfrac{\text{exp}(\beta_0+\beta_1 x_i)}{1+\text{exp}(\beta_0+\beta_1 x_i)}\]
or,
\[\begin{align}
\text{logit}(\pi_i)&=\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)\\
&= \beta_0+\beta_1 x_i\\
&= \beta_0+\beta_1 x_{i1} + \ldots + \beta_k x_{ik}\\
\end{align}\]
Assumptions:
 The data Y_{1}, Y_{2}, ..., Y_{n} are independently distributed, i.e., cases are independent.
 Distribution of Y_{i} is Bin(n_{i}, π_{i}), i.e., binary logistic regression model assumes binomial distribution of the response. The dependent variable does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)
 Does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the logit of the response and the explanatory variables; logit(π) = β_{0} + βX.
 Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
 The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
 Errors need to be independent but NOT normally distributed.
 It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on largesample approximations.
 Goodnessoffit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.
Model Fit:
 Overall goodnessoffit statistics of the model; we will consider:
 Pearson chisquare statistic, X^{2}
 Deviance, G^{2 }and Likelihood ratio test and statistic, ΔG^{2}
 HosmerLemeshow test and statistic
 Residual analysis: Pearson, deviance, adjusted residuals, etc...
 Overdispersion
Parameter Estimation:
The maximum likelihood estimator (MLE) for (β_{0}, β_{1}) is obtained by finding \((\hat{\beta}_0,\hat{\beta}_1)\) that maximizes:
\(L(\beta_0,\beta_1)=\prod\limits_{i=1}^N \pi_i^{y_i}(1\pi_i)^{n_iy_i}=\prod\limits_{i=1}^N \dfrac{\text{exp}\{y_i(\beta_0+\beta_1 x_i)\}}{1+\text{exp}(\beta_0+\beta_1 x_i)}\)
In general, there are no closedform solutions, so the ML estimates are obtained by using iterative algorithms such as NewtonRaphson (NR), or Iteratively reweighted least squares (IRWLS). In Agresti (2013), see section 4.6.1 for GLMs, and for logistic regression, see sections 5.5.45.5.5.
More Advanced material (not required): Brief overview of NewtonRaphson (NR). Suppose we want to maximize a loglikelihood $l(\theta)$ with respect to a parameter $\theta=(\theta_1,\ldots,\theta_p)^T$. At each step of NR, the current estimate $\theta^{(t)}$ is updated as \[ Applying NR to logistic regression. For the logit model with $p1$ predictors $X$, $\beta=(\beta_0, \beta_1, \ldots, \beta_{p1})$, the likelihood is \begin{eqnarray} so the loglikelihood is $ The first derivative of $x_i^T\beta$ with respect to $\beta_j$ is $x_{ij}$, thus The second derivatives used in computing the standard errors of the parameter estimates, $\hat{\beta}$, are \begin{eqnarray} \frac{\partial^2 l}{\partial\beta_j\partial\beta_k} For reasons including numerical stability and speed, it is generally advisable to avoid computing matrix inverses directly. Thus in many implementations, clever methods are used to obtain the required information without directly constructing the inverse, or even the Hessian. 
Interpretation of Parameter Estimates:
 exp(β_{0}) = the odds that the characteristic is present in an observation i when X_{i} = 0, i.e., at baseline.
 exp(β_{1}) = for every unit increase in X_{i1}, the odds that the characteristic is present is multiplied by exp(β_{1}). This is similar to simple linear regression but instead of additive change it is a multiplicative change in rate. This is an estimated odds ratio.
\(\dfrac{\text{exp}(\beta_0+\beta_1(x_{i1}+1))}{\text{exp}(\beta_0+\beta_1 x_{i1})}=\text{exp}(\beta_1)\)
In general, the logistic model stipulates that the effect of a covariate on the chance of "success" is linear on the logodds scale, or multiplicative on the odds scale.
 If β_{j} > 0, then exp(β_{j}) > 1, and the odds increase.
 If β_{j} < 0,then exp(β_{j}) < 1, and the odds decrease.
Inference for Logistic Regression:
 Confidence Intervals for parameters
 Hypothesis testing
 Distribution of probability estimates
_________________________
Example  Student Smoking
The table below classifies 5375 high school students according to the smoking behavior of the student (Z) and the smoking behavior of the student's parents (Y ). We saw this example in Lesson 3 (Measuring Associations in I × J tables, smokeindep.sas and smokeindep.R).
How many parents smoke?

Student smokes?


Yes (Z = 1)

No (Z = 2)


Both (Y = 1)

400

1380

One (Y = 2)

416

1823

Neither (Y = 3)

188

1168

The test for independence yields X^{2} = 37.6 and G^{2} = 38.4 with 2 df (pvalues are essentially zero), so Y and Z are related. It is natural to think of Z as a response and Y as a predictor in this example. We might be interested in exploring the dependency of student's smoking behavior on neither parent smoking versus at least one parent smoking. Thus we can combine or collapse the first two rows of our 3 × 2 table and look at a new 2 × 2 table:
Student
smokes

Student does
not smoke


1–2 parents smoke

816

3203

Neither parent smokes

188

1168

For the chisquare test of independence, this table has X^{2} = 27.7, G^{2} = 29.1, pvalue ≈ 0, and = 1.58. Therefore, we estimate that a student is 58% more likely, on the odds scale, to smoke if he or she has at least one smoking parent than none.
But what if:
 we want to model the "risk" of student smoking as a function of parents' smoking behavior.
 we want to describe the differences between student smokers and nonsmokers as a function of parents smoking behavior via descriptive discriminate analyses.
 we want to predict probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, e.g. what is the probability that a student is a smoker given that neither of his/her parents smokes.
 we want to predict that a student is a smoker given that neither of his/her parents smokes, i.e. probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, we want to classify new students into "smoking" or "nonsmoking" group depending on parents smoking behavior.
 we want to develop a social network model, adjust for "bias", analyze choice data, etc...
These are just some of the possibilities of logistic regression, which cannot be handled within a framework of goodnessoffit only.
Consider the simplest case:
 Y_{i} binary response, and X_{i} binary explanatory variable
 link to 2 × 2 tables and chisquare test of independence
Arrange the data in our running example like this,
y_{i}

n_{i}


1–2 parents smoke

816

4019

Neither parent smokes

188

1356

where y_{i} is the number of children who smoke, n_{i} is the number of children for a given level of parents' smoking behaviour, and π_{i} = P(y_{i} = 1x_{i}) is the probability of a randomly chosen student i smoking given parents' smoking status. Here i takes values 1 and 2. Thus, we claim that all students who have at least one parent smoking will have the same probability of "success", and all student who have neither parent smoking will have the same probability of "success", though for the two groups success probabilities will be different. Then the response variable Y has a binomial distribution:
\(y_i \sim Bin(n_i,\pi_i)\)
Explanatory variable X is a dummy variable such that
X_{i} = 0 if neither parent smokes,
X_{i} = 1 if at least one parent smokes.
Understanding the use of dummy variables is important in logistic regression.
Think about the following question, then click on the icon to the left display an answer. Can you explain to someone what is meant by a "dummy variable"? 
Then the logistic regression model can be expressed as:
\(\text{logit}(\pi_i)=\text{log}\dfrac{\pi_i}{1\pi_i}=\beta_0+\beta_1 X_i\) (1) or
\(\pi_i=\dfrac{\text{exp}(\beta_0+\beta_1 x_i)}{1+\text{exp}(\beta_0+\beta_1 x_i)}\) (2)
and it says that logodds of a randomly chosen student smoking is β_{0} for "smoking parents" = neither, and β_{0} + β_{1} for "smoking parents" = at least one parent is smoking.