12.6 - Video Example: March Madness
Have you ever filled out a March Madness bracket? For those of you who are not familiar with March Madness, it is the NCAA basketball tournament that takes place each spring in the month of March. It’s so big that the President even completes a bracket! If you have completed a bracket, how did you pick your teams? What was the reasoning that you used to fill it out? Here we will look at one way to use statistics to inform our decisions!
If you would like to work through this example on your own, the data can be found in the following file:
Let's use NCAA tournament points per game (PPG) as the response variable and regular season PPG as the explanatory variable. Here, we are assuming that a team with higher tournament PPG would win the tournament games. We have two variables that are quantitative, making simple linear regression the appropriate analysis tool.
Data were collected from the 2014-15 season. The video below walks through the analyses presented here. Below the video is a review of this process with a bit more detail.
Before using linear methods we should examine a scatterplot to ensure that the relationship is linear (as opposed to non-linear). The scatterplot below shows a linear, though weak, relationship between regular season and NCAA Tournament PPG.
We can compute the correlation coefficient to learn more about the relationship between these two variables.
|Pearson correlation of NCAA Tournament PPG and Regular Season PPG = 0.360180|
|P-Value = 0.0026|
We see that \(r=0.360180\). This is a moderately weak correlation. The \(p\)-value is low (\(p=0.0026\)) for the null hypothesis that \(\rho=0\) so this is a statistically significant correlation. In other words, we conclude that in the population (assuming this is a representative sample) the correlation between regular season and NCAA Tournament PPG is different from 0.
Next, we can construct a regression equation for predicting NCAA Tournament PPG using regular season PPG.
Before we can interpret regression output, we should check the assumptions of linear regression (LINE). We already checked the assumption of a linear relationship by examining the scatterplot of the two variables. We can use the plot of residuals versus fits below to check the assumptions of independent errors and equal error variances. Here, see that there is not a correlation between the residuals and fitted values. And, the variances of the residuals are approximately equal across all fitted values.
The final assumption that we must check is the normality of residuals. Using the normal probability plot or a histogram of the residuals we see that the residuals are approximately normally distributed. All assumptions have been met and it is appropriate to use linear regression methods with this data.
The ANOVA source table gives us information about the entire model. The \(p\) value for the model is 0.0026. Because this is simple linear regression, this is the same \(p\) value that we found earlier when we examined the correlation and the same \(p\) value that we see below in the test of the statistical significance for the slope. Our \(R^2\) value is 0.1297 which tells us that 12.97% of the variance in NCAA Tournament PPG can be explained by regular season PPG. This is a relatively low \(R^2\) value.
|Source||DF||Adj SS||Adj MS||F-Value||P-Value|
|Regular Season PPG||1||598.32||598.320||9.84||0.0026|
|Regular Season PPG||0.6197||0.1976||3.14||0.0026||1.00|
|NCAA Tournament PPG = 21.54 + 0.6197 Regular Season PPG|
While there is a statistically significant relationship between regular season PPG and NCAA Tournament PPG, the \(R^2\) value is relatively low. At this point we would probably go back and revise our theory. We may want to choose a different variable to predict NCAA Tournament PPG. Or, we may want to add additional variables to our current model by using multiple linear regression methods.