“You know, for having the name ‘Causal Kathy’ you don’t seem to post that much causal inference” – my beloved mentor giving me some solid life advice.
So alright, let’s talk about causal inference. First of all, what is causal inference? It’s a subfield of statistics that focuses on causation instead of just correlations. Think of this classic xkcd comic:
To give a more concrete example, let’s think about a clinical trial for some new drug and introduce some terminology along the way. Say we develop a drug that is supposed to lower cholesterol levels. In a perfect world, we could observe what happens to a patient’s cholesterol levels when we give him the drug and when we don’t. But we can never observe both outcomes in a world without split timelines and/or time machines. We call the two outcomes, under treatment and no treatment, “potential outcomes” or “counterfactuals.” The idea is that both outcomes have the potential to be true, but we can only ever observe one of them, so the other would be “counter to the fact of the truth.” Therefore, true, individual causal effects can never be calculated.
However, we can get at average causal effects across larger populations. Thinking of our drug example, we could enroll 1,000 people in a study and measure their baseline cholesterol levels. Then we could randomly assign 500 of those subjects to take the drug and assign the remaining 500 to a placebo drug. Follow up after a given period of time and measure everyone’s cholesterol levels again and find individual level changes from baseline. Then we could calculate the average change in each group and compare them. And since we randomized treatment assignment, we would have a good idea of the causal effect of the drug treatment on the outcome of cholesterol.
However, we can’t always randomize. In fact, it’s rare that we can. There are many reasons to prevent well controlled randomized trials. For example, if we wanted to test whether or not smoking causes cancer, it would be unethical to randomly assign 500 subjects to smoke for the rest of their lives and 500 subjects not to smoke and see who develops cancer. Similarly, if we want to know whether or not a catch & shoot (C&S ) shot in basketball has a positive effect on FG%, we can’t randomly assign shots to be C&S vs pull-up.
In general, basic statistics and parametric models only give correlations. When you fit a basic regression, the coefficients do not have causal interpretations. This is not to say that regressions and correlations aren’t interesting and useful – if you’ve been following my series on NBA fouls, you’ll note that I don’t do causal inference there. Associations are interesting in their own right. But today, let’s start looking at causal effects for C&S shots in the NBA.
Side note: If you’d like to read more about causal inference in the world of sports statistics, this recent post from Mike Lopez is great.
We’re going to be fitting models, so the first thing we’ll do is remove outlier shots. In this case I’m going to subset and only look at shots that were less than 30 feet (so between 10 and 30 feet). All the previous analysis was was basic and non-parametric, so including rare shots (like buzzer beater half court heaves) wouldn’t have a large impact. However now we are using models and making parametric assumptions, so outliers and influential points will have a bigger impact. It makes sense to remove them at this point.
Statistics side note: Part of the reason we need to remove outliers is because our analysis method will involve fitting “propensity scores” – the probably of treatment given the non-outcome variables. In this case we will model the probably a shot is a C&S given shot distance, defender distance, whether a shot was open, and the shot clock. For unusual outlier shots, the distance will often be abnormally long and it will be rare that the shot is successful. Thus if we left those shots in the data set, we would run into positivity problems.
Also, we aren’t going to look at the average causal effect of a C&S. That estimate assumes that all shots could be a C&S or a pull-up, and the player chose one or the other based on some variables. Most pull-up shots couldn’t have been a C&S, even if the player wanted them to be. But most C&S shots could have been pull-up shots. The player could have elected to hold the ball for longer and dribble a few times before taking the shot. Or even driven to the basket. Of course things are more nuanced. For example, some players (like my beloved Shaun Livingston) will almost never take a 3 point shot, C&S or pull-up. While others (like his teammate Klay Thompson) are loath to pass up a chance to take a C&S. Therefore looking at an average causal effect is not a great strategy. Instead we will look at the effect of a C&S on shots that were C&S shots. In the literature this is called the effect of treatment on the treated (ETT) or sometimes the average treatment effect on the treated (ATT). I use ETT, partially because that is what I was taught and partially because my advisor’s initials are ETT and it makes me giggle.
Below we see the effect of C&S on FG% on C&S shots. This effect is calculated for threes and twos. The estimates are computed with functionals that are functions of the observed data; they are not coefficients in a regression. As such, I calculated means, standard errors, and confidence intervals by repeatedly bootstrapping the data (250 times) and using weights randomly drawn from an exponential distribution with mean 1. This is a way to bootstrap the data without having to draw shots with replacement and uses every shot in every resample.
All estimates control for the following potential confounders: shot distance, defender distance, an indicator for whether a shot was open, and the time left on the shot clock. I do not think this is a rich enough set of variables to full control for confounding, but its a decent start.
We can see that not dribbling and possessing the ball for less than 2 seconds (catch & shoot definition) does have a significant effect on FG%. The effect size is small but positive, about 0.04 for both two and three point shots. This means that a C&S three point attempt is about 4% more likely to be successful than if that shot were taken as a pull-up. This is a very small causal effect.
To me, this shows that the effect of a C&S isn’t as big as the raw numbers suggest. I calculated a few other measures of causal effects, both for ETT and ATE, but found nothing significant. I’m certain that the modeling assumptions required are not fully met, which may be biasing results towards the null. Were I to move forward on this project, I would dig deeper into the models I’m using and try to get a better understanding of the best way to model and predict both why a player elects to take a certain type of shot and what makes a shot more likely to be successful.
When I set out on this project, I was mostly just upset about the definition of a catch & shoot, since it didn’t take openness into account. I like to think I’ve made my case. If I had to make a change, I’d want the NBA to track open C&S shots as a separate statistic. Maybe even split it out further into twos and threes, or at least emphasize EFG% over FG%. The actual causal effect of a C&S isn’t that big – I’d rather keep track of a stat that does have a big effect.
I’ll probably let this project go for a while. A lot of other people are looking at it and doing a good job of it. I can add the causal component, but I’d rather look at under-examined areas.