15.2 - Cross-Validation
Cross-validation is a method that uses the same data to both train the model and obtain a less biased estimate of prediction error than the direct estimate. The basic idea is to split the training data into two subsets - one subset is used to train the prediction rule and then the other subset is used to assess prediction error. To use the data efficiently, this is repeated with multiple splits of the data.
The earliest and still most commonly used method is leave-one-out cross-validation. One out of the n observations is set aside for validation and the prediction rule is trained on the other n-1 observations. The error in predicting the observation is recorded. This is repeated n times, leaving out each observation once. For regression, the average sum of squares of the prediction error (or its square root) is the estimated prediction error. For classification a weighted average of the number of misclassified observations is generally used. Although the method requires fitting the prediction rule n times, there are computationally efficient methods to do this for many commonly used predictors.
Statistical theory and simulation have demonstrated that leave-one-out cross-validation is not a good estimate of prediction error for every type of predictor. In particular it does not do well for problems such as determining the number of clusters or feature selection. As well, in some cases there are no known computationally efficient methods. In these cases, two other cross-validation strategies may be used. Leave-out-k cross-validation divides the data into a subset of k observations that will be used as the validation set, and the other n-k observations that are used for training. It then proceeds like leave-out-one cross-validation. Since there are many more subsets of size k than of size 1, often only a random sample of the subsets is used. There is some statistical theory to guide the choice of k.
k-fold cross-validation divides the entire dataset into k subsets. In turn, each subset is used as the validation sample, while the other k-1 subsets are combined to use as the training sample. There is statistical theory that shows that the appropriate choice of k depends on n and the type of predictor. However, in practise, the dependency is weak, and k=10 works well for a large range of sample sizes and in many problems.
Note that cross-validation is used to estimate prediction error and sometimes aspects of the prediction equation equation such as the number of clusters or number of predictors. The final predictor will be trained on all the training data.
There is a small problem with this method for assessing prediction error. The final predictor will be based on all N, or 100% percent of the sample but the estimated prediction error is based on predictor developed on a smaller sample: \(N > N-N/k\). So the cross-validation estimate of prediction error might actually be pessimistic - might have slightly better prediction error than you think. However, with 10-fold cross-validation can't be too far off because you are using at least 90% of your samples.
In some circumstances we want to pick the best of several choices of predictor. In this case, we use the same cross-validation strategy with each of the predictors and select the predictor with the smallest cross-validation estimate of prediction error. However, we have now over-used the sample in terms of estimating prediction error. It is best in this case to have yet another validation sample, called the test or hold-out sample, that is not used in the model development, but which can be used to assess the prediction error of the selected predictor. 
 Lever, J., Krzywinski, M. & Altman, N., Model Selection and Overfitting. Nature Methods, 13, 703–704 (2016)