12.3.1 - Formulas & Assumptions

Simple linear regression uses data from a sample to construct the line of best fit. But what makes a line “best fit”? The most common method of constructing a regression line, and the method that we will be using in this course, is the least squares method. The least squares method computes the values of the y-intercept and slope that make the sum of the squared residuals as small as possible. A residual is the difference between the actual value of y and the value of y predicted by \(\widehat{y}=b_0+b_1 x\). Residuals are symbolized by \(\varepsilon \) (“epsilon”) in a population and \(e\) or \(\widehat{\varepsilon }\) in a sample.

Residuals

As with most predictions, you expect there to be some error. In other words, you expect the prediction to not be exactly correct. For example, when predicting the percent of voters who selected your candidate, you would expect the prediction to be accurate but not necessarily the exact final voting percentage. Also, in regression, usually not every individual with the same \(x\) value has the same \(y\) value. For example, if we are using height to predict weight, not every person with the same height would have the same weight. These errors in regression predictions are called residuals or prediction error. The residuals are calculated by taking the observed \(y\) value minus its corresponding predicted \(y\) value. Therefore, each individual has a residual. The goal in least squares regression is to select the line that minimizes the squared residuals. In essence, we create a best fit line that has the least amount of error.

Residual
\(e_i =y_i -\widehat{y}_i\)

\(y_i\) = actual value of y for the ith observation
\(\widehat{y}_i\) = predicted value of y for the ith observation

Sum of Squared Residuals

Also known as Sum of Squared Errors (SSE)
\(SSE=\sum (y-\widehat{y})^2\)

Computing \(b_0\) and \(b_1\) by hand

Recall, the equation for a simple linear regression line is \(\widehat{y}=b_0+b_1x\) where \(b_0\) is the \(y\)-intercept and \(b_1\) is the slope.

Statistical software will compute the values of the \(y\)-intercept and slope that minimize the sum of squared residuals. The conceptual formulas below show how these statistics are related to one another and how they relate to correlation which you learned about earlier in this lesson. In this course we will always be using Minitab Express to compute these values.

Slope
\(b_1 =r \frac{s_y}{s_x}\)

\(r\) = Pearson’s correlation coefficient between \(x\) and \(y\)
\(s_y\) = standard deviation of \(y\)
\(s_x\) = standard deviation of \(x\)

y-intercept
\(b_0=\overline {y} – b_1 \overline {x}\)

\(\overline {y}\) = mean of \(y\)
\(\overline {x}\) = mean of \(x\)
\(b_1\) = slope

Assumptions of Simple Linear Regression

In order to use the methods above, there are four assumptions that must be met:

  1. Linearity: The relationship between \(x\) and y must be linear. Check this assumption by examining a scatterplot of \(x\) and \(y\).
  2. Independence of errors: There is not a relationship between the residuals and the \(y\) variable; in other words, \(y\) is independent from errors. Check this assumption by examining a scatterplot of “residuals versus fits”; the correlation should be approximately 0.
  3. Normality of errors: The residuals must be approximately normally distributed. Check this assumption by examining a normal probability plot; the observations should be near the line. You can also examine a histogram of the residuals; it should be approximately normally distributed.
  4. Equal variances: The variance of the residuals is the same for all values of \(x\). Check this assumption by examining the scatterplot of “residuals versus fits”; the variance of the residuals should be the same across all values of the \(x\)-axis. If the plot shows a pattern (e.g., bowtie or megaphone shape), then variances are not consistent and this assumption has not been met.

Example: Checking Assumptions Section

The following example uses students' scores on two tests.

  1. Linearity. The scatterplot below shows that the relationship between Test 3 and Test 4 scores is linear.

    Fitted Line Plot for Linear Model

  2. Independence of errors. The plot of residuals versus fits is shown below. The correlation shown in this scatterplot is approximately \(r=0\), thus this assumption has been met.

    Versus Fits (response is Test 4)

  3. Normality of errors. On the normal probability plot we are looking to see if our observations follow the given line. This tells us that the distribution of residuals is approximately normal. We could also look at the second graph which is a histogram of the residuals; here we see that the distribution of residuals is approximately normal.

    Normal Probability Plot (response is Test 4)

    Histogram of Frequency vs Residual (response is Test 4)

  4. Equal variance. Again we will use the plot of residuals versus fits. Now we are checking that the variance of the residuals is consistent across all fitted values.

    Versus Fits (response is Test 4)

The next section will show you how to construct simple linear regression equations using statistical software. The graphs shown above can be obtained when running the regression model using Minitab Express.

Review of New Terms

Before we continue, let’s review a few of the new terms:

Least squares method
Method of constructing a regression line which makes the sum of squared residuals as small as possible for the given data.
Residual
Actual value minus predicted value (i.e., \(e=y- \widehat{y}\)); vertical distance between the actual \(y\) value and the regression line.
Sum of squared residuals
The sum of all of the residuals squared: \(\sum (y-\widehat{y})^2\).