Cross Validation – Part 1

Recently I’ve seen a lot of misunderstanding about how/why cross validation (CV) is used in model selection/fitting. I’ve seen it misused in a number of ways and in a number of settings. I thought it might be worth it to write up a quick, hopefully accessible guide to CV.

Part 1 will cover general ideas. There won’t be any practical sports examples as I am trying to be very general here. I’ll have part 2 up in December with a practical example using some NBA shot data. I’ll post the code in a colab so people can see CV in action.

(There may be some slight abuse of notation)

Cross Validation: A Quick Primer

Guidelines

General Idea

Cross validation is generally about model selection. It also can be used to get an estimate of error.

Statistics is about quantifying uncertainty. Cross validation is a way to quantify uncertainty for many models and use the uncertainty to select one. Cross validation also can be used to quantify the uncertainty of a single model.

Some Quick Definitions

I think it is important to be clear about what I mean by a model. In simple terms, a statistical model is the way we are going to fit the data. A model encodes the underlying assumptions we have about the data generating process.

Examples of models might be:

  • Logistic with all available variables
  • Logistic with just variables A, B, C, and D
  • Random forest
  • SVM
  • LASSO
  • Neural network with some tuning parameter lambda
  • Neural network with a different tuning parameter kappa
  • Neural network where the tuning parameter is selected to maximize accuracy for the data being fit
  • etc.

When I refer to a model, I mean the general framework of the model, such as linear with variables A,B, C, and D. When I refer to a model fit, I mean the version of that model fit to the data, such as the coefficients on A, B, C, and D in a linear model (and intercept if needed).

Details

Generally we use cross validation to pick a model from a number of options. It helps us avoid overfitting. CV also helps us refine our beliefs about the underlying data generating process.

In the case of outcome prediction, we often need to tune the inputs used in the model (or data needed or whatever), explore different types of models, or determine which independent variables are of interest. We use CV to get estimates of error/metrics for various models (score, correlation, MSE, etc whatever) and pick one model from there. We can have CV do feature selection as well, but then we are testing that particular variable selection method, not the variables chosen by the selection method. For example we can use CV on a LASSO model (which incorporates variable selection), in which case we are testing LASSO, not the variables it selected. We could also test some particular set of variables in a model of their own.

The actual method for cross validation is as follows

For each candidate model:

  • Split the dataset into k folds, i.e. split the data into k equal-sized disjoint subsets, or “folds”
  • For each unique fold:
    • Take that fold as a validation set
    • Use the remaining k-1 folds as a training set
    • Fit the model onto the training set and use that fitted model on the held out kth fold to get a predicted outcome
    • Compare the predicted outcome to the truth, calculate error and other metrics etc.

Fit all the models this way and, generally, take the average of whatever metrics you calculated for each fold. But we might also look at the variance of the metrics. Then look at which model performs “best.” The definition of “best” will depend on what we care about. Is could be about maximizing true positives and true negatives. Or minimizing squared error loss. Or getting within X%. Whatever.

Once we have chosen a model based on the training data, we still want to get an estimate of prediction/estimation error for any new data that would be collected independently of the training data. If we have a new set of data, say we get access to a new season for a sport, then we can get an estimate of error using that new data. So if we decided that a simple logistic model with variables A, B, C, and A^2 is the “best”, we fit that model on all the training data to get coefficient estimates. We then predict outcomes for the new data, compare to the truth, and look at the error/correlation/score/whatever etc. This gives us our estimate of the true error/MSE/accuracy/AUROC etc.

We could instead decide that a LASSO model is best. So we fit LASSO on all our training data and use that to get coefficients/variables which we then apply to the validation data.

In the absence of a new data set, we still need to estimate the error. We can use CV to get an estimate of the error. We could do the same thing if we had a priori decided to use a model with certain variables. We take the model we decided on a priori, fit on k-1 folds, apply the fitted model to the held out kth fold, compare to truth, calculate error and other metrics etc. Do this for all folds and get metrics for all k folds. Then we can average those metrics, or look at their distribution.

Remember, statistics is about quantifying uncertainty. Cross validation is a tool for doing that. We do not get the final fitted model during the cross validation step.

Common Mistakes

Common scenarios, mistakes, and how to fix them:

  1. You already have an a priori idea of what model you are going to use to predict a continuous outcome – a linear regression with variables A, B, and C. You want to know the coefficients for A, B, and C.
    • Mistake: Fit that linear model on each of the k fold complement sets to get k sets of coefficients. Average those coefficients to get the final model.
    • Correction: Fit the linear model on all the training data at once to estimate coefficients. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get the coefficients from fitting the model on all the data.
  2. You already have an a priori idea of what model you are going to use to make binary classifications – a threshold model where if the probability of a positive outcome is above some p%, you classify it as positive. You want to know what p should be.
    • Mistake: Fit that threshold model on each of the k fold complements sets to get the optimal p_k% for each subset. Average p_k across all k folds to get the final threshold p.
    • Correction: Fit the threshold model on all the training data at once to estimate p. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get p from fitting the model on all the data.
  3. You have many ideas for potential models and want to know which one performs “best.” You fit each model on each of the k-1 folds and compare to each held out kth fold to estimate metrics. You decide a logistic regression model gives the best AUROC (your chosen metric of choice).
    • Mistake: You average the coefficients of all k logistic model fits in order to get the final model.
    • Correction: Fit the logistic model on all the training data at once to estimate coefficients. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get the coefficients from fitting the model on all the data
  4. You have many ideas for potential models and want to know which one performs “best.” You fit each model on each of the k-1 folds and compare to each held out kth fold to estimate metrics. You decide a LASSO (penalized logistic regression) model gives the best AUROC (your chosen metric of choice).
    • Mistake: You average the coefficients of all k logistic model fits in order to get the final model. The model fits don’t always select the same variables, so you just take all of them, but assign a coefficient of zero whenever a variable is not chosen.
    • Correction: Fit the LASSO model on all the training data at once to choose variables and estimate coefficients. Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Either way, you get the variables and coefficients from fitting the model on all the data
  5. You have many ideas for potential models and want to know which one performs “best.” You fit each model on each of the k-1 folds and compared to the held out kth fold to estimate metrics. You decide a random forest model gives the best MSE (your chosen metric of choice).
    • Mistake: You fit the random forest model on all the training data and use it to get predicted outcomes for all your training data samples. You then compare those predictions to the true outcomes and calculate an estimate of the error.
    • Correction:  Test the model fit on totally new data (if new data is available). Or use the k folds to get estimates of error. Estimating error from the data you used to select and fit the model will result in an underestimate of error.

Kathy Explains all of Statistics in 30 Seconds and “How to Succeed in Sports Analytics” in 30 Seconds

I spent the weekend of October 19-21 in Pittsburgh at the 2018 CMU Sports Analytics Conference. One of the highlights of the weekend was Sam Ventura asking me to explain causal inference in 15 seconds. I couldn’t quite do it, but it morphed into trying to explain all of statistics in 30 seconds. Which I then had to repeat a few times over the weekend. Figured I’d post it so people can stop asking. I’m expanding slightly.

Kathy Explains all of Statistics in 30 Seconds

Broadly speaking, statistics can be broken up into three categories: description, prediction, and inference.

  • Description
    • Summaries
    • Visualizations
  • Prediction
    • Mapping inputs to outputs
    • Predicting outcomes and distributions
  • Inference/Causal Inference
    • Prediction if the world had been different
    • Counterfactual/potential outcome prediction

I’ll give an example in the sports analytics world, specifically basketball (this part is what I will say if I only have 30 seconds):

  • Description
    • Slicing your data to look at the distribution of points per game (or per 100 possessions or whatever) scored by different lineups
  • Prediction
    • Predicting the number of points your team will score in a game given your planned lineups
  • Inference/Causal Inference
    • Prediction of change in points per game if you ran totally new lineups versus the normal lineups

My day job is working for a tech healthcare company, and the following are the examples I normally use in that world:

  • Description
    • Distributions of patient information for emergency department admissions stratified by length of stay
  • Prediction
    • Predicting length of stay based on patient information present on admission
  • Inference/Causal Inference
    • Prediction of change in length of stay if chest pain patient had stress test vs having cardiac catheterization

So, it’s not *all* of statistics. But I think its important to understand the different parts of statistics. They have different uses and different interpretations.

More thoughts from the conference

Any time I am at a sports conference there is always the question of “how does one succeed in/break into the field?” Many others have written about this topic, but I’ve started to see a lot of common themes. So….

How to Succeed in Sports Analytics in 30 Seconds

Success in sports analytics/statistics seems to require these 4 abilities:

  • Domain expertise
  • Communication
  • Statistics
  • Coding/programming/CS type skills

Imagine that each area has a max of 10 points. You gotta have at least 5/10 in every category and then like, at least 30 points overall. Yes I am speaking very vaguely. But the point is, you don’t have to be great at everything, but you do have to be great at something and decent at everything.

I don’t feel like I actually know that much about basketball or baseball, or any sport really. I didn’t play any sport in college, and generally when I watch games, I’m just enjoying the game. While watching the Red Sox in the playoffs I don’t really pay attention to the distribution of  David Price’s pitches, I just enjoy watching him pitch. Hell, I spend more time wondering what Fortnite skins Price has. I’ve been guessing Dark Voyager, but he also seems like the kind of guy to buy a new phone just to get the Galaxy skin. Anyway. I’m not an expert, but I do know enough to talk sensibly about sports and to help people with more expertise refine and sharpen their questions.

And I know statistics. And years of teaching during graduate school helped me get pretty damn good at explaining complicated statistical concepts in ways that most people can understand. Plus I can code (though not as well as others). Sports teams are chock full of sports experts, they need experts in other areas too.

These four skills are key to succeeding in any sort of analytical job. I’m not a medical expert, but I work with medical experts in my job and complement their skills with my own.

Concluding thoughts from the conference

Man, no matter what a talk is about, there’s always the questions/comments of “did you think about this other variable” (yes, but it wasn’t available in the data), “could you do it in this other sport…” (is there data available on that sport?), “what about this one example when the opposite happened?” (-_-), “you need to be clearer about how effects are mediated downstream, there’s no way this is a direct effect even if you’ve controlled for all the confounding” (ok that one’s usually me), etc.

Next time, we are going to make bingo cards.

 

Some Boston Marathon Numbers

I was enjoying the third quarter of the tight Raptors vs Wizards game on Sunday night when my coworker sent me this article and the accompanying comments on the Boston Marathon:

Oh my. This article makes me disappointed. So let’s skip Cavs/Pacers and Westworld and dig in.

Introduction

On the surface it feels like the article is going to have math to back up the claim that “men quit and women don’t.” It has *some:*

But finishing rates varied significantly by gender. For men, the dropout rate was up almost 80 percent from 2017; for women, it was up only about 12 percent. Overall, 5 percent of men dropped out, versus just 3.8 percent of women. The trend was true at the elite level, too.

And some attempt to examine more than just the 2018 race:

But at the same race in 2012, on an unusually hot 86-degree day, women also finished at higher rates than men, the only other occasion between 2012 and 2018 when they did. So are women somehow better able to withstand extreme conditions?

But that’s it. No more actual math or analyses. Just some anecdotes and attempts to explain biological or psychological reasons for the difference.

Let’s ignore those reasons (controversial as they may be) and just look at the numbers.

Analysis

The metrics used are ill-defined. There is mention of how the midrace dropout rate was up 50 percent overall from last year, but no split by gender. As quoted above, the finishing rates varied significantly by gender, but no numbers are given. Only the overall dropout rates are reported. What does overall dropout rate mean? I assume it is a combination of runners who dropped before the race began plus those who dropped midrace. And then the overall dropout rates are 3.8% for women and 5% for men. But the splashy number is that men dropped out 80% more than last year whereas women only dropped out 12% more. Is… is that right? I’ve already gone cross-eyed. The whole thing reeks of hacking and obscures the meaning.

There are a lot of numbers here. Some are combined across genders. Some are overall rates, some are midrace. Some are differences across years.

Frustrated with the lack of numbers in the article, I went looking for the actual numbers. I found the data on the official website. I wish it had been linked in the article itself…

2018

CATEGORY NUMBER ENTERED NUMBER STARTED NUMBER FINISHED
PERCENT FINISHED
Runners
all 29,978 26,948 25,746 95.50%
male 16,587 14,885 14,142 95.00%
female 13,391 12,063 11,604 96.20%

Now we can do some proper statistics.

First, we can perform an actual two sample test and construct confidence intervals to see if there was a difference in finishing rates between genders.

For those who entered the race, the 95% confidence interval for the difference in percent finished between males and females was (-0.022, -0.006).

For those who started the race, the 95% confidence interval for the difference in percent finished between males and females was (-0.017, -0.007).

The difference is technically significant, but not at all interesting. And that is ignoring the fact that we shouldn’t really care about p-values to begin with.

But the article mentions dropout rate, not finishing rate, so let’s use that metric:

Of those who started the race, about 5% of males and 3.8% of females dropped out.

For those who started the race, the 95% confidence interval for the difference in percent dropout between males and females was (0.0069, 0.0168).

So yes, there is a significant difference. But with these kinds of sample sizes, it’s not surprising or interesting to see a tiny significant difference.

But what about 2017? What about the big change from 2017 to 2018? After all the main splashy metric is the 80% increase in dropout for men.

2017 (numbers from here)

CATEGORY NUMBER ENTERED NUMBER STARTED NUMBER FINISHED
PERCENT FINISHED
Runners
all 30,074 27,222 26,400 97.00%
male 16,376 14,842 14,431 97.20%
female 13,698 12,380 11,969 96.70%

In 2017, for those who entered the race, the 95% confidence interval for the difference in percent finished between males and females was ( -0.00006, 0.01497).

And in 2017, for those who started the race, the 95% confidence interval for the difference in percent finished between males and females was (0.0013, 0.0097).

Of those who started the race in 2017, about 2.8% of males and 3.3% of females dropped out.

For those who started the race in 2017, the 95% confidence interval for the difference in percent dropout between males and females was ( -0.0097, -0.0013).

So it does look like women dropped out more in 2017 compared to 2018. But the difference is so tiny that… whatever. This isn’t interesting. But at least now there are actual statistics to back up the claim.

But really, there’s not a lot going on here.

And FINALLY, we can look at the differences from 2017 to 2018.

The dropout rate for females increased from ~3.3% to ~3.8% which (using the exact numbers) was an increase of about 14.6% (not the 12% reported in the NYT article). The dropout rate for males increased from ~2.8% to ~5.0% which (using the exact numbers) was an increase of about 80% as reported.

At least now I understand where these numbers are coming from.

I still don’t buy it. Using dropout numbers instead of finishing numbers makes ratios much larger. An 80% increase in dropout sounds a lot more impressive than a 2% drop in finishing.

And that’s all before we try to compare to other years that might have also had extreme weather. If I had more time or interest I might look at the temperature, humidity, wind speed, wind direction etc for the past 20+ marathons. And then look at differences in dropout/finishing rate for men and women while controlling for weather conditions. That sort of analysis still probably wouldn’t convince me, but it would get closer.

Conclusion

This article is really frustrating. There are just enough scraps of carefully chosen numbers to make differences seem bigger than they really are. Comparing dropout rates to finishing rates is a bit hacky, and then comparing just two years (as opposed to many) gets even hackier. There’s an interesting hypothesis buried in the article and the data. And if we were to pull data on many marathons, we might get closer to actually being able to test if dropout rates vary by gender according to conditions. But the way the data is presented in the article obscures any actual differences and invites controversy. Audiences are eager for guidance with statistics and math. Tossing around a few numbers without explaining them (or giving a link to the source…) is such poor practice.

That Other Site I Work On

This site has been sparse lately and it is because I’ve been busy with two other projects.

The first is my actual day job. I finished my PhD in May of 2017 and began working at Verily Life Sciences in August of 2017. Did I turn down some jobs with pro teams? Yes. Yes I did. Why? That’s a story for another day. I like what I do at Verily. I get to have fun, with people I like, working on cool healthcare projects. Plus we work out of the Google offices in Cambridge which are very nice and full of free food and fun toys.

The second project I’ve been working on is the visualizations section of Udam Saini’s EightThirtyFour.

http://eightthirtyfour.com/visualizations

Udam and I worked together on this site’s NBA foul project, which started as an attempt to quantify how mad DeMarcus Cousins gets in games. We built survival models and visualizations to examine how players accrue fouls. But these models can just as easily be applied to assists, blocks etc. In fact, I took the ideas and examined how Russell Westbrook accrued assists in his historic triple-double season. By using survival models, we can see how the time between assists increased significantly after he reached 10 assists in a game. This could be seen as evidence in favor of stat padding.

The tool we’ve built on the site linked above allows you to look at survival visualizations and models for pretty much any player in seasons between 2011 and 2017. The stats primer linked in the first line has more explanation and some suggestions for players and stats to look at.

Survival analysis models and visualizations are not always the easiest to explain, but I think there is value in having other ways to analyze and examine data. Survival analysis can help us better understand things like fatigue and stat padding. And can help add some math to intangible things like “tilt.”

This project was also a lesson in working on a problem with a proper software engineer. I am a statistician and I’m used to a certain amount of data wrangling and cleaning, but I largely prefer to get data in a nice data frame and go from there. And I certainly don’t have the prowess to create a cool interactive tool on a website that blends SQL and R and any number of other engineer-y things. Well. I’d like to think I could, but it would take ages and look much uglier. And be slower. Conversely, my partner in crime Udam probably can’t sort through all the statistics and R code as fast as I can. My background isn’t even in survival analysis, but I still understand it better than a SWE. So this part of his site was a chance for us to combine powers and see what we could come up with. In between our actual Alphabet jobs, of course.

I think in the world of sports analytics, it’s hard to find somebody who has it all: excellent software engineering skills, deep theoretical knowledge of statistics, and deep knowledge of the sport (be it basketball or another sport). People like that exist, to be sure, but they likely already work for teams or are in other fields. I once tried to be an expert in all three areas and it was very stressful and a lot of work. Once I realized that I couldn’t do it all by myself and started looking for collaborations, I found that I was able to really shine in my expert areas and have way more fun with the work I do.

The same is true in any field. I wasn’t hired by Verily to be a baller software engineer *and* an expert statistician *and* have a deep understanding of a specific health care area. I work with awesome healthcare experts and engineers and get to focus just on my area of expertise.

In both my job and my side sports projects my goal is always to have fun working on cool problems with people I like. It’s more fun to be part of a team.

Anyway, have fun playing with the site, and if you have any suggestions, let us know :]