As of September 1st 2019, I will be the Director of Strategic Research for the Toronto Raptors.
As such, this site will be on hiatus.
Causal inference and other statistics in the world of sports
As of September 1st 2019, I will be the Director of Strategic Research for the Toronto Raptors.
As such, this site will be on hiatus.
What follows is the work Sameer Deshpande and I did for the 2019 NFL Big Data Bowl. We will be presenting this work at the Finals on February 27th.
Consider two passing plays during the game between the Los Angeles Rams and visiting Indianapolis Colts in the first week of the 2017 season.
The first passing play was a short pass in the first quarter from Colts quarterback Scott Tolzien intended for T.Y. Hilton which was intercepted by Trumaine Johnson and returned for a Rams touchdown.
The second passing play was a long pass from Rams quarterback Jared Goff to Cooper Kupp, resulting in a Rams touchdown (time stamp 3:39).
In this work, we consider the question: which play had the better route(s)?
From one perspective, we could argue that Kupp’s route was better than Hilton’s; after all it resulted in the offense scoring while the first play resulted in a turnover and a defensive score. However evaluating a decision based only on its outcome is not always appropriate or productive. Two recent examples of similar plays come to mind: Pete Carroll’s decision to pass the ball from the 1 yard line in Super Bowl XLIX and the “Philly Special” in Super Bowl LII. Had the results of these two plays been reversed, Pete Carroll might have been celebrated and Doug Pederson criticized.
All this is to say, we shouldn’t condition on the observed outcome along.
If evaluating plays solely by their outcomes is inadequate, on what basis should we compare routes? Intuitively, we might tend to prefer routes which maximize the receiver’s chance of catching the pass, or completion probability.
If we let y be a binary indicator of whether a pass was caught and let x be a collection of covariates summarizing information about the pass, we can consider a logistic regression model of completion probability:
or equivalently , for some unknown function f.
If we know the function f, a first pass at assessing a route would be to plug in the relevant covariates x and see whether the forecasted completion probability exceeded some threshold, say 50%. If so, regardless of whether the receiver actually caught the actual pass, we could say that the route was run and ball was placed in such a way as to give the receiver a better chance than not of catching the pass.
Wait a minute, what’s f and what’re the inputs x, you might ask? We’ll go into all of the gory details later but suffice it to say: x contains what we’ll call “time of delivery” variables, which are recorded the moment the ball is thrown, and “time of arrival” variables, which are recorded when the receiver tries to catch the ball. Intuitively, we might expect that catch probability depends on both of these. And f, well f is probably some crazy nonlinear function of a bunch of variables. See Post 2 for more details.
We could then directly compare the forecasted completion probabilities of the two plays mentioned above; if it turned out that the Tolzien interception had a higher completion probability than the Kupp touchdown, that play would not seem as bad, despite the much worse outcome [spoiler: it wasn’t].
But why stop there? There are usually multiple eligible receivers running routes on a given pass play. What can we say about the nontargeted receivers? In particular, if the quarterback threw to a different location along a possibly different receiver’s route, can we predict the catch probability? It turns out, this is challenging for two fundamental reasons.
First, even if we knew the true function f, we are essentially trying to deduce what might have happened in a counterfactual world where the quarterback had thrown the ball to a different player at a different time, with the defense reacting differently. On such a counterfactual pass, we do not observe any “time of arrival” variables that may predictive of completion probability. Figure 1 illustrates this issue, showing schematics for an observed pass (left panel) and a hypothetical pass (right panel). In both passes, there are two receivers running routes; we have colored the route of the intended receiver on both passes blue and the route of the other receiver in gray.
Before proceeding, let’s pause for a moment to distinguish between our use of the term “counterfactual” and its use in causal inference.
Sameer and I are both fairly embedded in the world of causal inference (though he doesn’t have a twitter handle, email and website that prominently displays his love of all things causal. Rejoinder from Sameer: Bayes is bae. I make no apologies.) and it feels weird to use the term “counterfactual” and not elaborate.
The general causal framework of counterfactuals supposes that we change some treatment or exposure variable and asks what happens to downstream outcomes. In contrast, in this work, we considering changing a midstream variable, the location of the intended receiver when the ball arrives, and then impute both upstream and downstream variables like the time of the pass and the receiver separation at the time the ball arrives. In this work, we use “counterfactual” interchangeably with “hypothetical” and hope our more liberal usage is not a source of further confusion below. We use the word “counterfactual” interchangeably with “hypothetical” because while an unobserved pass is hypothetical, the intended receiver of that pass is not.
Ok, I’ve said my piece.
The second fundamental challenge: we typically do not know the function f and must therefore estimate it using the observed data. Even if we knew how to overcome the issue of unobserved “time of arrival” inputs for the hypothetical passes, estimation uncertainty about f will propagate to the forecasts of hypothetical completion probabilities. So we’re going to need to estimate f in a way that makes it quantify uncertainty downstream functionals In doing so, estimation uncertainty about f propagates to the uncertainty about the hypothetical completion probabilities.
So to recap: we’re positing there’s some true function f that takes in “time of release” variables and “time of arrival” variables and outputs the logodds of a receiver catching the pass. We don’t know this function so we need to estimate f. We then want to take this estimate and plugin inputs about hypothetical passes to predict the completion probability for every receiver involved at all times during a play. Unfortunately, we don’t actually know the value of the “time of arrival” variables for the hypothetical passes.
If you’re still with us, you might be thinking “Wait a second! I can sidestep the fact that we never observe the hypothetical “time of arrival” variables by letting f only depend on “time of release” variables. And you’d technically be right! But it strains credulity to believe, for instance, that how far a receiver is from his closest defender doesn’t affect his chances of catching the ball. So, restricting f to not depend on “time of arrival” variables seems like a decidedly arbitrary solution to our first challenge. Technically, we’d need to first establish that models of catch probability that account for “time of arrival” variables predicts better than one that does not. But we’re willing to make this intuitive assumption for now.
OK, so we want to evaluate a function that we’re uncertain about at inputs about which we’re also uncertain. We overcome the two challenges in this work.Using tracking, play, and game data from the first 6 weeks of the 2017 NFL season, we developed Expected Hypothetical Completion Probability (EHCP).
At a highlevel, our framework consists of two steps:
In Part 2 of this blog post series, we will describe our Bayesian procedure for fitting a catch probability model like in the equation above and outline the EHCP framework.
In Part 3, we will discuss the results of our catch probability model and illustrate the EHCP framework on several routes.
Finally in Part 4, we will conclude with a discussion of potential methodological improvements and refinements and potential uses of our EHCP framework.
What follows is some info on the work Sameer Deshpande and I did for the 2019 NFL Big Data Bowl. We will be presenting this work at the BDB Finals at the NFL Combine in Indianapolis on February 27th.
We are in the process of putting together a series of blog posts that will explain our method in, hopefully, an easily digestible way. Until then, we wanted to share a copy of the paper as it was submitted to the contest.
Expected Hypothetical Completion Probability – link to pdf
We note that there are a few caveats:
1. This is very much proofofconcept. EHCP is a modular framework that involves lots of pieces. We have put the pieces together, but none are optimized at the moment.
2. There are many technical and conceptual details to discuss. We’re going to dive into many of these details in the coming blog posts. Additionally, we’re happy to discuss the paper with particularly interested parties.
That being said,
3. Please be patient! We’re posting the paper we submitted to the Big Data Bowl contest. We recognize that the writeup is somewhat technical and terse when it comes to the finer details of our methodology. Over the last few weeks, we’ve received some great feedback and questions from some of our friends and colleagues in sports and academia. Our plan in the next several posts is to respond to this feedback and hopefully address a bunch of initial questions. So please be patient with us; if you send us a bunch of burning questions and we don’t respond, it’s not entirely because we’re avoiding you.
Q: Did you think about including other variables, such as QB pressure, time from snap to throw, defensive schemes, player information, etc?
A: We considered many variables, but had to limit scope due to time constraints. Incorporating additional variables is a clear opportunity for further work.
Q: Why BART?
A: Over an evergrowing range of problems, BART has demonstrated really great predictive performance with minimal hyperparameter tuning and without the need to prespecify a specific functional relationship between inputs and outputs. While we didn’t do it in our analysis, BART can also be adapted to do feature selection. At the same time, it’s totally plausible that another regression technique would be effective for the problem.
Q: Did you consider other outcomes like YAC or expected yards gained?
A: We did. Ultimately, we may want to maximize expected value of a play E[value  input variables], which we can further decompose as:
E[value  input variables] = E[value  catch, x] * P(catch  x) + E[value  no catch, x] * P(no catch  x)
We focused on the P(catch  x) part.
Q: Wait a minute! You need to do a better job of modeling the conditional distribution of the unobserved variables on the observed ones. There is no way they are independent. Especially since they may change as the route develops.
A: That’s not really a question, but we agree. Handling the missing variables is one of the modular parts of the framework and can be optimized independently of the other parts. It is an interesting missing data question in its own right.
Q: Who is the best QB/WR?
A: We didn’t have enough data to draw any strong conclusions. Jameis Winston looked great in the data we had available to us.
Q: Does this run on the block chain?
A: 😑
Q: Did y’all try deep lear–
A: No.
Recently I’ve seen a lot of misunderstanding about how/why cross validation (CV) is used in model selection/fitting. I’ve seen it misused in a number of ways and in a number of settings. I thought it might be worth it to write up a quick, hopefully accessible guide to CV.
Part 1 will cover general ideas. There won’t be any practical sports examples as I am trying to be very general here. I’ll have part 2 up in December with a practical example using some NBA shot data. I’ll post the code in a colab so people can see CV in action.
(There may be some slight abuse of notation)
Cross validation is generally about model selection. It also can be used to get an estimate of error.
Statistics is about quantifying uncertainty. Cross validation is a way to quantify uncertainty for many models and use the uncertainty to select one. Cross validation also can be used to quantify the uncertainty of a single model.
I think it is important to be clear about what I mean by a model. In simple terms, a statistical model is the way we are going to fit the data. A model encodes the underlying assumptions we have about the data generating process.
Examples of models might be:
When I refer to a model, I mean the general framework of the model, such as linear with variables A,B, C, and D. When I refer to a model fit, I mean the version of that model fit to the data, such as the coefficients on A, B, C, and D in a linear model (and intercept if needed).
Generally we use cross validation to pick a model from a number of options. It helps us avoid overfitting. CV also helps us refine our beliefs about the underlying data generating process.
In the case of outcome prediction, we often need to tune the inputs used in the model (or data needed or whatever), explore different types of models, or determine which independent variables are of interest. We use CV to get estimates of error/metrics for various models (score, correlation, MSE, etc whatever) and pick one model from there. We can have CV do feature selection as well, but then we are testing that particular variable selection method, not the variables chosen by the selection method. For example we can use CV on a LASSO model (which incorporates variable selection), in which case we are testing LASSO, not the variables it selected. We could also test some particular set of variables in a model of their own.
For each candidate model:
Fit all the models this way and, generally, take the average of whatever metrics you calculated for each fold. But we might also look at the variance of the metrics. Then look at which model performs “best.” The definition of “best” will depend on what we care about. Is could be about maximizing true positives and true negatives. Or minimizing squared error loss. Or getting within X%. Whatever.
Once we have chosen a model based on the training data, we still want to get an estimate of prediction/estimation error for any new data that would be collected independently of the training data. If we have a new set of data, say we get access to a new season for a sport, then we can get an estimate of error using that new data. So if we decided that a simple logistic model with variables A, B, C, and A^2 is the “best”, we fit that model on all the training data to get coefficient estimates. We then predict outcomes for the new data, compare to the truth, and look at the error/correlation/score/whatever etc. This gives us our estimate of the true error/MSE/accuracy/AUROC etc.
We could instead decide that a LASSO model is best. So we fit LASSO on all our training data and use that to get coefficients/variables which we then apply to the validation data.
In the absence of a new data set, we still need to estimate the error. We can use CV to get an estimate of the error. We could do the same thing if we had a priori decided to use a model with certain variables. We take the model we decided on a priori, fit on k1 folds, apply the fitted model to the held out kth fold, compare to truth, calculate error and other metrics etc. Do this for all folds and get metrics for all k folds. Then we can average those metrics, or look at their distribution.
Remember, statistics is about quantifying uncertainty. Cross validation is a tool for doing that. We do not get the final fitted model during the cross validation step.
Common scenarios, mistakes, and how to fix them:
The slides for my talk from the 2018 CMU Sports Analytics Conference can be found here.
The talk concerns NBA fouls rates and is a condensed version of the blog posts found on this very site (Part 1 can be found here). However the numbers are updated to include more recent seasons.
I spent the weekend of October 1921 in Pittsburgh at the 2018 CMU Sports Analytics Conference. One of the highlights of the weekend was Sam Ventura asking me to explain causal inference in 15 seconds. I couldn’t quite do it, but it morphed into trying to explain all of statistics in 30 seconds. Which I then had to repeat a few times over the weekend. Figured I’d post it so people can stop asking. I’m expanding slightly.
Broadly speaking, statistics can be broken up into three categories: description, prediction, and inference.
I’ll give an example in the sports analytics world, specifically basketball (this part is what I will say if I only have 30 seconds):
My day job is working for a tech healthcare company, and the following are the examples I normally use in that world:
So, it’s not *all* of statistics. But I think its important to understand the different parts of statistics. They have different uses and different interpretations.
Any time I am at a sports conference there is always the question of “how does one succeed in/break into the field?” Many others have written about this topic, but I’ve started to see a lot of common themes. So….
Success in sports analytics/statistics seems to require these 4 abilities:
Imagine that each area has a max of 10 points. You gotta have at least 5/10 in every category and then like, at least 30 points overall. Yes I am speaking very vaguely. But the point is, you don’t have to be great at everything, but you do have to be great at something and decent at everything.
I don’t feel like I actually know that much about basketball or baseball, or any sport really. I didn’t play any sport in college, and generally when I watch games, I’m just enjoying the game. While watching the Red Sox in the playoffs I don’t really pay attention to the distribution of David Price’s pitches, I just enjoy watching him pitch. Hell, I spend more time wondering what Fortnite skins Price has. I’ve been guessing Dark Voyager, but he also seems like the kind of guy to buy a new phone just to get the Galaxy skin. Anyway. I’m not an expert, but I do know enough to talk sensibly about sports and to help people with more expertise refine and sharpen their questions.
And I know statistics. And years of teaching during graduate school helped me get pretty damn good at explaining complicated statistical concepts in ways that most people can understand. Plus I can code (though not as well as others). Sports teams are chock full of sports experts, they need experts in other areas too.
These four skills are key to succeeding in any sort of analytical job. I’m not a medical expert, but I work with medical experts in my job and complement their skills with my own.
Man, no matter what a talk is about, there’s always the questions/comments of “did you think about this other variable” (yes, but it wasn’t available in the data), “could you do it in this other sport…” (is there data available on that sport?), “what about this one example when the opposite happened?” (_), “you need to be clearer about how effects are mediated downstream, there’s no way this is a direct effect even if you’ve controlled for all the confounding” (ok that one’s usually me), etc.
Next time, we are going to make bingo cards.
I was enjoying the third quarter of the tight Raptors vs Wizards game on Sunday night when my coworker sent me this article and the accompanying comments on the Boston Marathon:
Oh my. This article makes me disappointed. So let’s skip Cavs/Pacers and Westworld and dig in.
On the surface it feels like the article is going to have math to back up the claim that “men quit and women don’t.” It has *some:*
But finishing rates varied significantly by gender. For men, the dropout rate was up almost 80 percent from 2017; for women, it was up only about 12 percent. Overall, 5 percent of men dropped out, versus just 3.8 percent of women. The trend was true at the elite level, too.
And some attempt to examine more than just the 2018 race:
But at the same race in 2012, on an unusually hot 86degree day, women also finished at higher rates than men, the only other occasion between 2012 and 2018 when they did. So are women somehow better able to withstand extreme conditions?
But that’s it. No more actual math or analyses. Just some anecdotes and attempts to explain biological or psychological reasons for the difference.
Let’s ignore those reasons (controversial as they may be) and just look at the numbers.
The metrics used are illdefined. There is mention of how the midrace dropout rate was up 50 percent overall from last year, but no split by gender. As quoted above, the finishing rates varied significantly by gender, but no numbers are given. Only the overall dropout rates are reported. What does overall dropout rate mean? I assume it is a combination of runners who dropped before the race began plus those who dropped midrace. And then the overall dropout rates are 3.8% for women and 5% for men. But the splashy number is that men dropped out 80% more than last year whereas women only dropped out 12% more. Is… is that right? I’ve already gone crosseyed. The whole thing reeks of hacking and obscures the meaning.
There are a lot of numbers here. Some are combined across genders. Some are overall rates, some are midrace. Some are differences across years.
Frustrated with the lack of numbers in the article, I went looking for the actual numbers. I found the data on the official website. I wish it had been linked in the article itself…
2018
CATEGORY  NUMBER ENTERED  NUMBER STARTED  NUMBER FINISHED 
PERCENT FINISHED

Runners  
all  29,978  26,948  25,746  95.50% 
male  16,587  14,885  14,142  95.00% 
female  13,391  12,063  11,604  96.20% 
Now we can do some proper statistics.
First, we can perform an actual two sample test and construct confidence intervals to see if there was a difference in finishing rates between genders.
For those who entered the race, the 95% confidence interval for the difference in percent finished between males and females was (0.022, 0.006).
For those who started the race, the 95% confidence interval for the difference in percent finished between males and females was (0.017, 0.007).
The difference is technically significant, but not at all interesting. And that is ignoring the fact that we shouldn’t really care about pvalues to begin with.
But the article mentions dropout rate, not finishing rate, so let’s use that metric:
Of those who started the race, about 5% of males and 3.8% of females dropped out.
For those who started the race, the 95% confidence interval for the difference in percent dropout between males and females was (0.0069, 0.0168).
So yes, there is a significant difference. But with these kinds of sample sizes, it’s not surprising or interesting to see a tiny significant difference.
But what about 2017? What about the big change from 2017 to 2018? After all the main splashy metric is the 80% increase in dropout for men.
2017 (numbers from here)
CATEGORY  NUMBER ENTERED  NUMBER STARTED  NUMBER FINISHED 
PERCENT FINISHED

Runners  
all  30,074  27,222  26,400  97.00% 
male  16,376  14,842  14,431  97.20% 
female  13,698  12,380  11,969  96.70% 
In 2017, for those who entered the race, the 95% confidence interval for the difference in percent finished between males and females was ( 0.00006, 0.01497).
And in 2017, for those who started the race, the 95% confidence interval for the difference in percent finished between males and females was (0.0013, 0.0097).
Of those who started the race in 2017, about 2.8% of males and 3.3% of females dropped out.
For those who started the race in 2017, the 95% confidence interval for the difference in percent dropout between males and females was ( 0.0097, 0.0013).
So it does look like women dropped out more in 2017 compared to 2018. But the difference is so tiny that… whatever. This isn’t interesting. But at least now there are actual statistics to back up the claim.
But really, there’s not a lot going on here.
And FINALLY, we can look at the differences from 2017 to 2018.
The dropout rate for females increased from ~3.3% to ~3.8% which (using the exact numbers) was an increase of about 14.6% (not the 12% reported in the NYT article). The dropout rate for males increased from ~2.8% to ~5.0% which (using the exact numbers) was an increase of about 80% as reported.
At least now I understand where these numbers are coming from.
I still don’t buy it. Using dropout numbers instead of finishing numbers makes ratios much larger. An 80% increase in dropout sounds a lot more impressive than a 2% drop in finishing.
And that’s all before we try to compare to other years that might have also had extreme weather. If I had more time or interest I might look at the temperature, humidity, wind speed, wind direction etc for the past 20+ marathons. And then look at differences in dropout/finishing rate for men and women while controlling for weather conditions. That sort of analysis still probably wouldn’t convince me, but it would get closer.
This article is really frustrating. There are just enough scraps of carefully chosen numbers to make differences seem bigger than they really are. Comparing dropout rates to finishing rates is a bit hacky, and then comparing just two years (as opposed to many) gets even hackier. There’s an interesting hypothesis buried in the article and the data. And if we were to pull data on many marathons, we might get closer to actually being able to test if dropout rates vary by gender according to conditions. But the way the data is presented in the article obscures any actual differences and invites controversy. Audiences are eager for guidance with statistics and math. Tossing around a few numbers without explaining them (or giving a link to the source…) is such poor practice.