Catch & Shoot – Causal Effects

This is Part 3 of my series on Catch & Shoot jumpers
Part 1 can be found here
Part 2 can be found here

“You know, for having the name ‘Causal Kathy’ you don’t seem to post that much causal inference” – my beloved mentor giving me some solid life advice.

So alright, let’s talk about causal inference. First of all, what is causal inference? It’s a subfield of statistics that focuses on causation instead of just correlations. Think of this classic xkcd comic:

correlation

To give a more concrete example, let’s think about a clinical trial for some new drug and introduce some terminology along the way. Say we develop a drug that is supposed to lower cholesterol levels. In a perfect world, we could observe what happens to a patient’s cholesterol levels when we give him the drug and when we don’t. But we can never observe both outcomes in a world without split timelines and/or time machines. We call the two outcomes, under treatment and no treatment, “potential outcomes” or “counterfactuals.” The idea is that both outcomes have the potential to be true, but we can only ever observe one of them, so the other would be “counter to the fact of the truth.” Therefore, true, individual causal effects can never be calculated.

However, we can get at average causal effects across larger populations. Thinking of our drug example, we could enroll 1,000 people in a study and measure their baseline cholesterol levels. Then we could randomly assign 500 of those subjects to take the drug and assign the remaining 500 to a placebo drug. Follow up after a given period of time and measure everyone’s cholesterol levels again and find individual level changes from baseline. Then we could calculate the average change in each group and compare them. And since we randomized treatment assignment, we would have a good idea of the causal effect of the drug treatment on the outcome of cholesterol.

However, we can’t always randomize. In fact, it’s rare that we can. There are many reasons to prevent well controlled randomized trials. For example, if we wanted to test whether or not smoking causes cancer, it would be unethical to randomly assign 500 subjects to smoke for the rest of their lives and 500 subjects not to smoke and see who develops cancer. Similarly, if we want to know whether or not a catch & shoot (C&S ) shot in basketball has a positive effect on FG%, we can’t randomly assign shots to be C&S vs pull-up.

In general, basic statistics and parametric models only give correlations. When you fit a basic regression, the coefficients do not have causal interpretations. This is not to say that regressions and correlations aren’t interesting and useful – if you’ve been following my series on NBA fouls, you’ll note that I don’t do causal inference there. Associations are interesting in their own right. But today, let’s start looking at causal effects for C&S shots in the NBA.

Side note: If you’d like to read more about causal inference in the world of sports statistics, this recent post from Mike Lopez is great.

Analysis

We’re going to be fitting models, so the first thing we’ll do is remove outlier shots. In this case I’m going to subset and only look at shots that were less than 30 feet (so between 10 and 30 feet). All the previous analysis was was basic and non-parametric, so including rare shots (like buzzer beater half court heaves) wouldn’t have a large impact. However now we are using models and making parametric assumptions, so outliers and influential points will have a bigger impact. It makes sense to remove them at this point.

Statistics side note: Part of the reason we need to remove outliers is because our analysis method will involve fitting “propensity scores” – the probably of treatment given the non-outcome variables. In this case we will model the probably a shot is a C&S given shot distance, defender distance, whether a shot was open, and the shot clock. For unusual outlier shots, the distance will often be abnormally long and it will be rare that the shot is successful. Thus if we left those shots in the data set, we would run into positivity problems.

Also, we aren’t going to look at the average causal effect of a C&S. That estimate assumes that all shots could be a C&S or a pull-up, and the player chose one or the other based on some variables. Most pull-up shots couldn’t have been a C&S, even if the player wanted them to be. But most C&S shots could have been pull-up shots. The player could have elected to hold the ball for longer and dribble a few times before taking the shot. Or even driven to the basket. Of course things are more nuanced. For example, some players (like my beloved Shaun Livingston) will almost never take a 3 point shot, C&S or pull-up. While others (like his teammate Klay Thompson) are loath to pass up a chance to take a C&S. Therefore looking at an average causal effect is not a great strategy. Instead we will look at the effect of a C&S on shots that were C&S shots. In the literature this is called the effect of treatment on the treated (ETT) or sometimes the average treatment effect on the treated (ATT). I use ETT, partially because that is what I was taught and partially because my advisor’s initials are ETT and it makes me giggle.

For those interested, here is a quick primer on the math behind ETT.

Results

Below we see the effect of C&S on FG% on C&S shots. This effect is calculated for threes and twos. The estimates are computed with functionals that are functions of the observed data; they are not coefficients in a regression. As such, I calculated means, standard errors, and confidence intervals by repeatedly bootstrapping the data (250 times) and using weights randomly drawn from an exponential distribution with mean 1. This is a way to bootstrap the data without having to draw shots with replacement and uses every shot in every resample.

All estimates control for the following potential confounders:  shot distance, defender distance, an indicator for whether a shot was open, and the time left on the shot clock. I do not think this is a rich enough set of variables to full control for confounding, but its a decent start.

ETT_Results_8_17_17

We can see that not dribbling and possessing the ball for less than 2 seconds (catch & shoot definition) does have a significant effect on FG%. The effect size is small but positive, about 0.04 for both two and three point shots. This means that a C&S three point attempt is about 4% more likely to be successful than if that shot were taken as a pull-up. This is a very small causal effect.

To me, this shows that the effect of a C&S isn’t as big as the raw numbers suggest. I calculated a few other measures of causal effects, both for ETT and ATE, but found nothing significant. I’m certain that the modeling assumptions required are not fully met, which may be biasing results towards the null. Were I to move forward on this project, I would dig deeper into the models I’m using and try to get a better understanding of the best way to model and predict both why a player elects to take a certain type of shot and what makes a shot more likely to be successful.

When I set out on this project, I was mostly just upset about the definition of a catch & shoot, since it didn’t take openness into account. I like to think I’ve made my case. If I had to make a change, I’d want the NBA to track open C&S shots as a separate statistic. Maybe even split it out further into twos and threes, or at least emphasize EFG% over FG%. The actual causal effect of a C&S isn’t that big – I’d rather keep track of a stat that does have a big effect.

I’ll probably let this project go for a while. A lot of other people are looking at it and doing a good job of it. I can add the causal component, but I’d rather look at under-examined areas.

 

 

Catch & Shoot – EFG%

This is Part 2 of my series on Catch & Shoot jumpers
Part 1 can be found here

Last time, we ended by looking at a basic logistic regression predicting success of a shot, conditioned on whether a shot was: a catch & shoot (C&S), a three point attempt, and open. This time we will start considering effective field goal percentage (EFG%), which gives an additional bonus to three point shots.

For anybody unaware of the difference between FG% and EFG%, here is the brief but informative definition from basketball reference:

“Effective Field Goal Percentage; the formula is (FG + 0.5 * 3P) / FGA. This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal. For example, suppose Player A goes 4 for 10 with 2 threes, while Player B goes 5 for 10 with 0 threes. Each player would have 10 points from field goals, and thus would have the same effective field goal percentage (50%).”

Let’s start our investigation into EFG% by comparing EFG% to FG% for C&S vs pull-up jumpers split out between all shots and just 3 point shots.

FG% All shots 3 point 2 point
C&S 39.3% 37.4% 43.3%
Pull-Up 34.9% 29.9% 37.0%
EFG% All shots 3 point 2 point
C&S 52.0% 56.1% 43.3%
Pull-Up 39.2% 44.8% 37.0%

By using EFG% instead of FG% it becomes much clearer that C&S is  a better shot than a pull-up jumper.

We could also split the data by whether or not these shots were open (as we first saw in part 1).

EFG% All shots Open Defended
C&S 52.0% 53.3% 45.9%
Pull-Up 39.2% 41.4% 36.6%

We see that, of course, open shots are better than defended shots. However we can also see that using EFG% shows that a defended C&S is better than an open pull-up. Even without seeing the raw numbers, we suspect these results come from a large number of C&S shots being 3-point attempts.

So we could stratify further and look at C&N vs 3-point vs openness. And while it would be easy to make a number of stratified 2×2 tables, at a certain point it makes more sense to just use a model and account for as many possible variables that could effect FG% or EFG%. Which is not to say that examining raw percentages is a bad idea. After all, tables are a simple way to compare different kinds of shots, and since we have a large number of shots, we won’t really run into any sparsity problems.

But I don’t want to spend too long just looking at basic statistics. So, let’s continue down our previous path of looking at a simple regression to predict shot success, and see how we can improve it. However, we quickly run into two potential problems.

The first problem is one we touched on previously – looking at confounders. We want to understand variables that effect whether a shot is successful and that effect a players decision to take a specific type of shot. Last time we looked at defender distance as a potential confounder. This time we will also consider the shot clock. If there are only a few seconds left on the clock, a player may not have time to drive to the basket, and will have to just shoot. For future analyses, I’d want to explore other variables that are potential confounders such as game time remaining, the score, and who the closest defender is. But for now, let’s keep things relatively simple.

The second problem we will face is more complicated – how do model EFG%? Modeling FG% is easy because our outcome is binary, a shot is successful or not. Logistic regression requires a binary outcome, so we can’t just give successful 3 point shots an outcome of 1.5. Most statistics software will allow us to use weights in a quasibinomial framework, but I can’t think of a good way to use weights to get at EFG%. Weights are used to create pseudo-populations that up-weight or down-weight certain shots depending on how representative they are. The problem with giving a successful 3 point shot a weight of 1.5 is that it doesn’t make the outcome 1.5, rather it increases the representation of the characteristics of that shot.

If anybody has a way to examine EFG% using a weighted regression, please let me know. I only spent a few days thinking about this and while I have a work around, I would love to be able to show this analysis just use a simple regression framework. But I cannot, for the life of me, think of a way to do it. I tried for a while to reframe the problem by using functionals instead of trying to target a regression parameter, but I still don’t think it works.

So what is my work around? Don’t look at EFG%. Instead split out 3 point shots and 2 point shots and examine them separately. 3 point shots and 2 point shots are different enough that trying to pool them into a single population will obscure the differences and lead to analytical problems. Especially since it may be naive to assume a constant treatment effect of C&S for both 2-point and 3-point shots. We could also split out the two kinds of shots and instead look at the expected number of points per shot. Stephen Shea has touched on this, which makes me think it is a good avenue for further investigation.

On a more philosophical level, there always seems to be this strong desire to collapse everything down to single number. We see this a lot when we try to invent statistics that fully capture how good a player is with one number. And while I agree there is value in a single statistic, I also think there is value in nuance and increased granularity. My goal is to examine C&S shots, and there is no harm in splitting that out by shot value.

But I freely admit that I may be missing something obvious and there is an easy way to use EFG%. Again, if you have any ideas, please let me know.

Next time in this series, we will dive into causal effects of catch & shoot vs pull-up jumpers.

Catch & Shoot – Basics

I’ve been interested in catch & shoot (C&S) jump shots for a while now, pretty much ever since I read this article by Stephen Shea. There is this idea that a C&S is better than a pull-up shot. And based on everything I’ve read and analyzed, this holds true. This lead me to ask “what is it about a C&S that makes it such a superior shot?”

To answer this question, I started with the NBA definition of catch & shoot, namely “Any jump shot outside of 10 feet where a player possessed the ball for 2 seconds or less and took no dribbles.” Let’s break this definition down a little.
– A shot past 10 feet  is further from the basket, which I would think decreases the likelihood of success. However we can assume that when comparing to pull-up jumpers, the comparison group is past 10 feet as well. When the time come to analyze data, we will just have to restrict to shots past 10 feet. No problem.
– Taking no dribbles means a player doesn’t need to take time to collect the ball or fight the momentum of movement. It makes sense that not having to dribble would increase the chance of a making a basket.
– On the surface, possessing the ball for 2 seconds or less seems like it would give a player less time to set himself and shoot. Therefore I would assume short possession time would decrease the probability of success. However, a shorter possession means that the defense has less time to get in position as well. And this is where I get irked about the definition of catch & shoot.

When I began to look at C&S vs pull-up jumpers I hypothesized that the effect of the defense was confounding the effect of a C&S. I still thought a C&S shot would be better than an pull-up, but without controlling for defense, I was skeptical of how much better.

Let’s take a step back and define some terms. If we think of a C&S as a “treatment” ( my education and training are in public health so I often default to medical terms) compared to a “control” shot being a pull-up jumper, and shot success as the outcome, then our goal is to examine the treatment effect of a CnS on the outcome. We can also say that the type of shot is the independent variable, and success of the shot is the dependent variable. But shots are not randomized to be CnS or pull-up, so just looking at raw numbers won’t necessarily give us the full picture. A “confounder” is any variable that effects both the treatment and the outcome. Defense is very likely a confounder for any shot as a player is more like to take and more likely to make a wide open shot. Similarly, he is less likely to take and less likely to make a highly contested shot.

The NBA has a really great statistics site, http://stats.nba.com/. However, it doesn’t get to the granularity that I want. Thankfully www.nbasavant.com does. I went and pulled 50,000 shots from the 2014-2015 season. I chose that season because starting in 2016, defender data isn’t available. I also restricted to players who took at least 100 jump shots. I ended up with 50,000 shots (I assume this size is preset), which I then further restricted to 38,384 jump shots that were over 10 feet.

Of these 38,384 shots 21,917 had zero dribbles and a possession of 2 seconds or less and thus were defined as a catch & shoot. The remaining 16,467 shots were labeled as pull-up shots.

One more definition, for the purposes of this analysis (and future analyses), I consider any shot where the closest defender is more than 4 feet away to be an “open” shot. All other shots are considered “defended.”

Here are some basic statistics:

Counts Defended Open
C&S 4049 17868
Pull-Up 7468 8999
Percent Defended Open
C&S 10.54% 46.55%
Pull-Up 19.46% 23.44%

Most shots are open C&S. The majority of C&S shots are open, and the majority of open shots are C&S.

I could split this out further and look at 2pt shots vs 3 pt shots, but let’s skip that for now and put it into a simple logistic regression modeling the probability a shot is successful. We can include an indicator for whether or not a shot was a 3pt attempt:

Estimate Std. Error z value Pr(>|z|)
Intercept -0.5467 0.0175 -31.23 0.0000
C&S 0.2955 0.0233 12.70 0.0000
3pt -0.2727 0.0230 -11.88 0.0000

Now let’s control for wether a shot was open or not:

Estimate Std. Error z value Pr(>|z|)
Intercept -0.6305 0.0217 -29.06 0.0000
C&S 0.2597 0.0239 10.88 0.0000
open 0.1626 0.0245 6.62 0.0000
3pt -0.2927 0.0232 -12.64 0.0000

Side note: I try not to pay too much attention to p-values, preferring to focus on effect sizes.

We see that C&S does increase the chance of success, though the effect is mitigated when we also control for whether or not a shot was open. I also fit the models with various interactions, but they had little impact (in fact the interaction between C&S and open was negative, though small).

This is why I am always a little irked by the definition of a catch & shoot – it doesn’t account for how open the player is. A player with a high C&S  FG% who mostly takes open shots should be evaluated differently than a player with a similarly high C&S  FG% who takes mostly defended shots. Yet raw C&S FG% or C&S EFG% obscures this difference.

We are just getting started here. After all,  so far I’ve only presented simple regressions with some (confusing and hard to interpret) log odds ratio coefficients. In future posts I will get into how we can estimate the causal effect of a catch & shoot for only for raw field goal percentage, but also effective field goal percentage, which will give a bonus to three point shots.