A Prediction Interval for a New Y

Printer-friendly versionPrinter-friendly version

On the previous page, we focused our attention on deriving a confidence interval for the mean μY at x, a particular value of the predictor variable. Now, we'll turn our attention to deriving a prediction interval, not for a mean, but rather for predicting a (that's one!) new observation of the response, which we'll denote Yn+1, at x, a particular value of the predictor variable. Let's again just jump right in and state (and prove) the result.

Theorem. A (1−α)100% prediction interval for a new observation Yn+1 when the predictor x = xn+1 is:

\[\hat{y}_{n+1} \pm t_{\alpha/2,n-2}\sqrt{MSE} \sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\]

 Proof. First, recall that: 

\[Y_{n+1} \sim N(\alpha+\beta(x_{n+1}-\bar{x}),\sigma^2)\] and \[\hat{\alpha} \sim N\left(\alpha,\dfrac{\sigma^2}{n}\right)\] and  \[\hat{\beta}\sim N\left(\beta,\dfrac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}\right)\] and  \[\dfrac{n\hat{\sigma}^2}{\sigma^2}=\dfrac{(n-2)MSE}{\sigma^2}\sim \chi^2_{(n-2)}\]

are independent. Therefore:

\[W=Y_{n+1}-\hat{Y}_{n+1}=Y_{n+1}-\hat{\alpha}-\hat{\beta}(x_{n+1}-\bar{x})\]

is a linear combination of independent normal random variables with mean:

eqn

and variance:

eqn

The first equality holds by the definition of W. The second equality holds because \[Y_{n+1}\], \(\hat{\alpha}\) and \[\hat{\beta}\] are independent. The third equality comes from the distributions of \[Y_{n+1}\], \(\hat{\alpha}\)  and \[\hat{\beta}\] that are recalled above. And, the last equality comes from simple algebra. Putting it all together, we have:

\[W=(Y_{n+1}-\hat{Y}_{n+1})\sim N\left(0,\sigma^2\left[1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right]\right)\]

Now, the definition of a T random variable tells us that:

\[T=\dfrac{\dfrac{(Y_{n+1}-\hat{Y}_{n+1})-0}{\sqrt{\sigma^2\left(1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}}}{\sqrt{\dfrac{n\hat{\sigma}^2}{\sigma^2}/(n-2)}}=\dfrac{(Y_{n+1}-\hat{Y}_{n+1})}{\sqrt{MSE}\sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \sim t_{n-2}\]

So, finding the prediction interval for Yn+1 again reduces to manipulating the quantity inside the parentheses of a probability statement:

\[P\left(-t_{\alpha/2,n-2} \leq \dfrac{(Y_{n+1}-\hat{Y}_{n+1})}{\sqrt{MSE}\sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \leq +t_{\alpha/2,n-2}\right)=1-\alpha\]

Upon doing the manipulation, we get that a (1−α)100% prediction interval for Yn+1 is:

\[\hat{y}_{n+1} \pm t_{\alpha/2,n-2}\sqrt{MSE} \sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\]

as was to be proved.

old faithfulExample (continued)

The eruptions of Old Faithful Geyser in Yellowstone National Park, Wyoming are quite regular (and hence its name). Rangers post the predicted time until the next eruption (y, in minutes) based on the duration of the previous eruption (x, in minutes). Using the data collected on 107 eruptions from a park geologist, R. A. Hutchinson, what is the predicted time until the next eruption if the previous eruption lasted 4.8 minutes? lasted 3.5 minutes?

Solution. Again, the easiest (and most practical!) way of calculating the prediction interval for the new observation is to let Minitab do the work for us. Here's what the resulting analysis looks like:

minitab

That is, we can be 95% confident that, if the previous eruption lasted 4.8 minutes, then the time until the next eruption is between 71.969 and 98.801minutes. And, we can be 95% confident that, if the previous eruption lasted 3.5 minutes, then the time until the next eruption is between 58.109 and 84.734 minutes.

Let's do one of the calculations by hand, though. When the previous eruption lasted x = 4.8 minutes, then the predicted time until the next eruption is:

\[\hat{y}=33.828 + 10.741(4.8)=85.385\]

Now, we can use Minitab or a probability calculator to determine that t0.025,105 = 1.9828. We can also use Minitab to determine that MSE equals 44.6 (it is rounded to 45 in the above output), the mean duration is 3.46075 minutes, and:

\[\sum\limits_{i=1}^n (x_i-\bar{x})^2=113.835\]

Putting it all together, we get:

\[85.385 \pm 1.9828 \sqrt{44.66} \sqrt{1+\dfrac{1}{107}+\dfrac{(4.8-3.46075)^2}{113.835}}\]

which simplifies to this:

\[85.385 \pm 13.416\]

and finally this:

\[(71.969,98.801)\]

as we (thankfully) obtained previously using Minitab. Incidentally, you might note that the length of the confidence interval for μY when x = 4.8 is:

\[87.484-83.286=4.198\]

and the length of the prediction interval when x = 4.8 is:

\[98.801-71.969=26.832\]

Hmmm. I wonder if that means that the confidence interval will always be narrower than the prediction interval? That is indeed the case. Let's take note of that, as well as a few other things. 

taking notesNotes

(1) For a given value x of the predictor variable, and confidence level (1−α), the prediction interval for a new observation Yn+1 is always longer than the corresponding confidence interval for the mean μY. That's because the prediction interval has an extra term (MSE, the estimate of the population variance) in its standard error:

eqn

(2) The prediction interval for a new observation Yn+1 can be made to be narrower in the same ways that we can make the confidence interval for the mean μnarrower. That is, we can make a prediction interval for a new observation Yn+1 narrower by (i) decreasing the confidence level, (ii) increasing the sample size, (iii) choosing predictor values xi so that they are quite spread out, and (iv) predicting Yn+1 at the mean of the predictor values.

(3) We cannot make the standard error of the prediction for Yn+1 approach 0, as we can for the standard error of the estimate for μY.  That's again because the prediction interval has an extra term (MSE, the estimate of the population variance) in its standard error:

eqn