Recently I’ve seen a lot of misunderstanding about how/why cross validation (CV) is used in model selection/fitting. I’ve seen it misused in a number of ways and in a number of settings. I thought it might be worth it to write up a quick, hopefully accessible guide to CV.

Part 1 will cover general ideas. There won’t be any practical sports examples as I am trying to be very general here. I’ll have part 2 up in December with a practical example using some NBA shot data. I’ll post the code in a colab so people can see CV in action.

(There may be some slight abuse of notation)

# Cross Validation: A Quick Primer

## Guidelines

### General Idea

Cross validation is generally about model selection. It also can be used to get an estimate of error.

Statistics is about quantifying uncertainty. Cross validation is a way to quantify uncertainty for many models and use the uncertainty to select one. Cross validation also can be used to quantify the uncertainty of a single model.

### Some Quick Definitions

I think it is important to be clear about what I mean by a model. In simple terms, a statistical model is the way we are going to fit the data. A model encodes the underlying assumptions we have about the data generating process.

Examples of models might be:

- Logistic with all available variables
- Logistic with just variables A, B, C, and D
- Random forest
- SVM
- LASSO
- Neural network with some tuning parameter lambda
- Neural network with a different tuning parameter kappa
- Neural network where the tuning parameter is selected to maximize accuracy for the data being fit
- etc.

When I refer to a **model**, I mean the general framework of the model, such as linear with variables A,B, C, and D. When I refer to a **model fit**, I mean the version of that model fit to the data, such as the coefficients on A, B, C, and D in a linear model (and intercept if needed).

### Details

Generally we use cross validation to pick a model from a number of options. It helps us avoid overfitting. CV also helps us refine our beliefs about the underlying data generating process.

In the case of outcome prediction, we often need to tune the inputs used in the model (or data needed or whatever), explore different types of models, or determine which independent variables are of interest. We use CV to get estimates of error/metrics for various models (score, correlation, MSE, etc whatever) and pick one model from there. We can have CV do feature selection as well, but then we are testing that particular variable selection method, not the variables chosen by the selection method. For example we can use CV on a LASSO model (which incorporates variable selection), in which case we are testing LASSO, not the variables it selected. We could also test some particular set of variables in a model of their own.

##### The actual method for cross validation is as follows

For each candidate model:

- Split the dataset into k folds, i.e. split the data into k equal-sized disjoint subsets, or “folds”
- For each unique fold:
- Take that fold as a validation set
- Use the remaining k-1 folds as a training set
- Fit the model onto the training set and use that fitted model on the held out kth fold to get a predicted outcome
- Compare the predicted outcome to the truth, calculate error and other metrics etc.

Fit all the models this way and, generally, take the average of whatever metrics you calculated for each fold. But we might also look at the variance of the metrics. Then look at which model performs “best.” The definition of “best” will depend on what we care about. Is could be about maximizing true positives and true negatives. Or minimizing squared error loss. Or getting within X%. Whatever.

Once we have chosen a model based on the training data, we still want to get an estimate of prediction/estimation error for any new data that would be collected independently of the training data. If we have a new set of data, say we get access to a new season for a sport, then we can get an estimate of error using that new data. So if we decided that a simple logistic model with variables A, B, C, and A^2 is the “best”, we fit that model on all the training data to get coefficient estimates. We then predict outcomes for the new data, compare to the truth, and look at the error/correlation/score/whatever etc. This gives us our estimate of the true error/MSE/accuracy/AUROC etc.

We could instead decide that a LASSO model is best. So we fit LASSO on all our training data and use that to get coefficients/variables which we then apply to the validation data.

In the absence of a new data set, we still need to estimate the error. We can use CV to get an estimate of the error. We could do the same thing if we had a priori decided to use a model with certain variables. We take the model we decided on a priori, fit on k-1 folds, apply the fitted model to the held out kth fold, compare to truth, calculate error and other metrics etc. Do this for all folds and get metrics for all k folds. Then we can average those metrics, or look at their distribution.

Remember, statistics is about quantifying uncertainty. Cross validation is a tool for doing that. We do not get the final fitted model during the cross validation step.

## Common Mistakes

Common scenarios, mistakes, and how to fix them:

- You already have an a priori idea of what model you are going to use to predict a continuous outcome – a linear regression with variables A, B, and C. You want to know the coefficients for A, B, and C.
**Mistake**: Fit that linear model on each of the k fold complement sets to get k sets of coefficients. Average those coefficients to get the final model.**Correction**: Fit the linear model on all the training data at once to estimate coefficients. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get the coefficients from fitting the model on all the data.

- You already have an a priori idea of what model you are going to use to make binary classifications – a threshold model where if the probability of a positive outcome is above some p%, you classify it as positive. You want to know what p should be.
**Mistake**: Fit that threshold model on each of the k fold complements sets to get the optimal p_k% for each subset. Average p_k across all k folds to get the final threshold p.**Correction**: Fit the threshold model on all the training data at once to estimate p. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get p from fitting the model on all the data.

- You have many ideas for potential models and want to know which one performs “best.” You fit each model on each of the k-1 folds and compare to each held out kth fold to estimate metrics. You decide a logistic regression model gives the best AUROC (your chosen metric of choice).
**Mistake**: You average the coefficients of all k logistic model fits in order to get the final model.**Correction**: Fit the logistic model on all the training data at once to estimate coefficients. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get the coefficients from fitting the model on all the data

- You have many ideas for potential models and want to know which one performs “best.” You fit each model on each of the k-1 folds and compare to each held out kth fold to estimate metrics. You decide a LASSO (penalized logistic regression) model gives the best AUROC (your chosen metric of choice).
**Mistake**: You average the coefficients of all k logistic model fits in order to get the final model. The model fits don’t always select the same variables, so you just take all of them, but assign a coefficient of zero whenever a variable is not chosen.**Correction**: Fit the LASSO model on all the training data at once to choose variables and estimate coefficients. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get the variables and coefficients from fitting the model on all the data

- You have many ideas for potential models and want to know which one performs “best.” You fit each model on each of the k-1 folds and compared to the held out kth fold to estimate metrics. You decide a random forest model gives the best MSE (your chosen metric of choice).
**Mistake**: You fit the random forest model on all the training data and use it to get predicted outcomes for all your training data samples. You then compare those predictions to the true outcomes and calculate an estimate of the error.**Correction**: Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Estimating error from the data you used to select and fit the model will result in an underestimate of error.