Russell Westbrook and Assists

Introduction

I was going to flesh this idea out and refine it for a proper paper/poster for NESSIS, but since I have to be in a wedding that weekend (sigh), here are my current raw thoughts on Russell Westbrook. I figured it was best to get these ideas out now …  before I become all consumed by The Finals.

I’ve been thinking a lot about Russell Westbrook and his historic triple-double season. Partially I’ve been thinking about how arbitrary the number 10 is, and how setting 10 to be a significant cutoff is similar to setting 0.05 as a p-value cutoff. But also I have been thinking about stat padding. It’s been pretty clear that Westbrook’s teammates would let him get rebounds, but there’s also been a bit of a debate about how he accrues assists. The idea being that once he gets to 10, he stops trying to get assists. Now this could mean that he passes less, or his teammates don’t shoot as much, or whatever. I’m not concerned with the mechanism, just the timing. For now.

I’ll examining play-by-play data and box-score data from the NBA for the 2016-2017 season. This data is publicly available from http://www.nba.com. The play-by-play contains rich event data for each game. The box-score includes data for which players started the game, and which players were on the court at the start of a quarter. Data, in csv format, can be found here.

Let’s look at the time to assist for every assist Westbrook gets and see if it significantly changes for assists 1-10 vs 11+. I thought about looking at every assist by number and doing a survival analysis, but soon ran into problems with sparsity and granularity. Westbrook had games with up to 22 assists, so trying to look at them individually got cumbersome. Instead I decided to group assists as follows: 1-7, 8-10 and 11+. I reasoned that Westbrook’s accrual rate for the first several assists would follow one pattern, which would then increase as he approached 10, and then taper off for assists 11+.

I freely admit that may not be the best strategy and am open to suggestions.

I also split out which games I would examine into 3 groups: all games, games where he got at least 11 assists, and games where he got between 11 and 17 assists. This was to try to account for right censoring from the end of the game. In other words, when we look at all games, we include games where he only got, say, 7 assists, and therefore we cannot hope to observe the difference in time to assist 8 vs assist 12. Choosing to cut at 17 assists was arbitrary and I am open to changing it to fewer or more.

Basic Stats

Our main metric of interest is the time between assists, i.e. how many seconds of player time (so time when Westbrook is on the floor) occur between assists.

First, let us take a look at some basic statistics, where we examine the mean, median, and standard deviation for the time to assist broken down by group and by the different sets of games. Again, this is in seconds of player time.

BasicRussStats

We can see that if we look at all games, it appears that the time between assists goes down on average once Westbrook gets past 10 assists. However this sample of games includes games where he got upwards of 22 assists, which, given the finite length of games, means assists would tend to happen more frequently. Limiting ourselves to games with at least 11 assists, or games with 11-17 assists gives a view of a more typical game with many assists. We see in (1b) and (1c) that time to assist increases on average once Westbrook got his 10th assist.

However, these basic statistics only account for assists that Westbrook actually achieved, they do not account for any right censoring. That is, say Westbrook gets 9 assists in a game in the first half alone, and doesn’t record another assist all game despite playing, say, 20 minutes in the second half. If there game were to go on indefinitely, Westbrook eventually would record that 10th assist, say after 22 minutes. But since we never observe that hypothetical 10th assist, that contribution of 22 minutes isn’t included. Nor is even the 20 minutes of assist-less play. This basic censoring problem is why we use survival models.

Visualization

Next we can plot Kaplan Meier survival curves for Westbrook’s assists broken down by group and by the different sets of games. I used similar curves when looking at how players accrue personal fouls – and I’ll borrow my language from there:

A survival curve, in general, is used to map the length of time that elapses before an event occurs. Here, they give the probability that a player has “survived” to a certain time without recording an assist (grouped as explained above). These curves are useful for understanding how a player accrues assists while accounting for the total length of time during which a player is followed, and allows us to compare how different assists are accrued.

RussSurvCurves

Here is it very easy to see that the time between assists increases significantly once Westbrook has 10 assists. This difference is apparent regardless of which subset of games we look at, though the increase is more pronounced when we ignore games with fewer than 11 assists. We can also see that the time between assists doesn’t differ significantly between the first 7 assists and assists 8 through 10.

Survival Models

Finally we could put the data into a conditional risk set model for ordered events. I’m not sure this is the best model to use for this data structure, given that I grouped the assists, but it will do for now. I recommend not looking at the actual numbers and just noticing that yes, theres is a significant difference between the baseline and the group of 11+ assists.

SurvModelsRuss

If interested we can find the hazard ratios associated with each assist group. To do so we exponentiate the coefficients since each coefficient is the log comparison with respect to the baseline of the 1st  through 7th assists. For example, looking at the final column, we see that, in games where Westbrook had between 11 and 17 assists, he was 63% less likely to record an assist greater than 10 versus how likely he was to record one of his first 7 assists (the baseline group). Interpreting coefficients is very annoying at times. The take away here is yes, there is a statistically significant difference.

Discussion

Based on some simple analysis, it appears that the time between Russell Westbrook’s assists decreased once he reached 10 assists. This may contribute to the narrative that he stopped trying to get assists after he reached 10. Perhaps this is because he stopped passing, or perhaps its because his teammates just shot less effectively on would-be-assisted shots after 10. Additionally, there are many other factors that could contribute to the decline in time between assists. Perhaps there is general game fatigue, and assist rates drop off for all players. Maybe those games were particularly close in score and therefore Westbrook chose to take jump shots himself or drive to the basket.

What’s great is that a lot of these ideas can be explored using the data. We could look at play by play data and see if Russ was passing at the same rates before and after assist number 10. We could test if assist rates decline overall in the NBA as games progress. I’m not sure which potential confounding explanations are worth running down at the moment. Please, please, please, let me know in the comments, via email, or on Twitter if you have any suggestions or ideas.

REMINDER: The above analysis is something I threw together in the days between my graduation celebrations and The Finals starting and isn’t as robust or detailed as I might like. Take with a handful of salt.

Catch & Shoot – Causal Effects

This is Part 3 of my series on Catch & Shoot jumpers
Part 1 can be found here
Part 2 can be found here

“You know, for having the name ‘Causal Kathy’ you don’t seem to post that much causal inference” – my beloved mentor giving me some solid life advice.

So alright, let’s talk about causal inference. First of all, what is causal inference? It’s a subfield of statistics that focuses on causation instead of just correlations. Think of this classic xkcd comic:

correlation

To give a more concrete example, let’s think about a clinical trial for some new drug and introduce some terminology along the way. Say we develop a drug that is supposed to lower cholesterol levels. In a perfect world, we could observe what happens to a patient’s cholesterol levels when we give him the drug and when we don’t. But we can never observe both outcomes in a world without split timelines and/or time machines. We call the two outcomes, under treatment and no treatment, “potential outcomes” or “counterfactuals.” The idea is that both outcomes have the potential to be true, but we can only ever observe one of them, so the other would be “counter to the fact of the truth.” Therefore, true, individual causal effects can never be calculated.

However, we can get at average causal effects across larger populations. Thinking of our drug example, we could enroll 1,000 people in a study and measure their baseline cholesterol levels. Then we could randomly assign 500 of those subjects to take the drug and assign the remaining 500 to a placebo drug. Follow up after a given period of time and measure everyone’s cholesterol levels again and find individual level changes from baseline. Then we could calculate the average change in each group and compare them. And since we randomized treatment assignment, we would have a good idea of the causal effect of the drug treatment on the outcome of cholesterol.

However, we can’t always randomize. In fact, it’s rare that we can. There are many reasons to prevent well controlled randomized trials. For example, if we wanted to test whether or not smoking causes cancer, it would be unethical to randomly assign 500 subjects to smoke for the rest of their lives and 500 subjects not to smoke and see who develops cancer. Similarly, if we want to know whether or not a catch & shoot (C&S ) shot in basketball has a positive effect on FG%, we can’t randomly assign shots to be C&S vs pull-up.

In general, basic statistics and parametric models only give correlations. When you fit a basic regression, the coefficients do not have causal interpretations. This is not to say that regressions and correlations aren’t interesting and useful – if you’ve been following my series on NBA fouls, you’ll note that I don’t do causal inference there. Associations are interesting in their own right. But today, let’s start looking at causal effects for C&S shots in the NBA.

Side note: If you’d like to read more about causal inference in the world of sports statistics, this recent post from Mike Lopez is great.

Analysis

We’re going to be fitting models, so the first thing we’ll do is remove outlier shots. In this case I’m going to subset and only look at shots that were less than 30 feet (so between 10 and 30 feet). All the previous analysis was was basic and non-parametric, so including rare shots (like buzzer beater half court heaves) wouldn’t have a large impact. However now we are using models and making parametric assumptions, so outliers and influential points will have a bigger impact. It makes sense to remove them at this point.

Statistics side note: Part of the reason we need to remove outliers is because our analysis method will involve fitting “propensity scores” – the probably of treatment given the non-outcome variables. In this case we will model the probably a shot is a C&S given shot distance, defender distance, whether a shot was open, and the shot clock. For unusual outlier shots, the distance will often be abnormally long and it will be rare that the shot is successful. Thus if we left those shots in the data set, we would run into positivity problems.

Also, we aren’t going to look at the average causal effect of a C&S. That estimate assumes that all shots could be a C&S or a pull-up, and the player chose one or the other based on some variables. Most pull-up shots couldn’t have been a C&S, even if the player wanted them to be. But most C&S shots could have been pull-up shots. The player could have elected to hold the ball for longer and dribble a few times before taking the shot. Or even driven to the basket. Of course things are more nuanced. For example, some players (like my beloved Shaun Livingston) will almost never take a 3 point shot, C&S or pull-up. While others (like his teammate Klay Thompson) are loath to pass up a chance to take a C&S. Therefore looking at an average causal effect is not a great strategy. Instead we will look at the effect of a C&S on shots that were C&S shots. In the literature this is called the effect of treatment on the treated (ETT) or sometimes the average treatment effect on the treated (ATT). I use ETT, partially because that is what I was taught and partially because my advisor’s initials are ETT and it makes me giggle.

For those interested, here is a quick primer on the math behind ETT.

Results

Below we see the effect of C&S on FG% on C&S shots. This effect is calculated for threes and twos. The estimates are computed with functionals that are functions of the observed data; they are not coefficients in a regression. As such, I calculated means, standard errors, and confidence intervals by repeatedly bootstrapping the data (250 times) and using weights randomly drawn from an exponential distribution with mean 1. This is a way to bootstrap the data without having to draw shots with replacement and uses every shot in every resample.

All estimates control for the following potential confounders:  shot distance, defender distance, an indicator for whether a shot was open, and the time left on the shot clock. I do not think this is a rich enough set of variables to full control for confounding, but its a decent start.

ETT_Results_7_28_18

We can see that not dribbling and possessing the ball for less than 2 seconds (catch & shoot definition) does effect FG%, for both 2 and 3 point shots. The effect size is small but positive, about 0.04. This means that a C&S attempt is about 4% more likely to be successful than if that shot were taken as a pull-up. This is a very small causal effect.

To me, this shows that the effect of a C&S isn’t as big as the raw numbers suggest. I calculated a few other measures of causal effects, both for ETT and ATE, but found nothing significant. I’m certain that the modeling assumptions required are not fully met, which may be biasing results towards the null. Were I to move forward on this project, I would dig deeper into the models I’m using and try to get a better understanding of the best way to model and predict both why a player elects to take a certain type of shot and what makes a shot more likely to be successful.

When I set out on this project, I was mostly just upset about the definition of a catch & shoot, since it didn’t take openness into account. I like to think I’ve made my case. If I had to make a change, I’d want the NBA to track open C&S shots as a separate statistic. Maybe even split it out further into twos and threes, or at least emphasize EFG% over FG%. The actual causal effect of a C&S isn’t that big – I’d rather keep track of a stat that does have a big effect.

I’ll probably let this project go for a while. A lot of other people are looking at it and doing a good job of it. I can add the causal component, but I’d rather look at under-examined areas.

 

 

NBA Fouls – Substitutions and Discussion

This is part 4 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.
Part 3 can be found here.
Part 2 can be found here.
Part 1 can be found here.

I strongly recommend reading parts 2 and 3 before continuing as this series builds on the past.

Substitutions

A natural question that arises from our previous analysis is to question if anything can be done to prevent a player from “tilting.” We now show that making quick substitutions can change how a player accrues fouls and reduce his “tilt.” We define a quick substitution (QS) as a substitution that occurs within 30 seconds of a personal foul. While this definition may capture substitutions that are not a reaction to the player committing a foul, we believe it is adequate for the purposes of this paper. Fouls are then classified as happening before or after the QS. As a result, games without a QS will classify every foul as happening before a hypothetical QS, which may never be observed. Furthermore, for ease of analysis, we only consider the first time a player has a QS, despite the possibility of it happening more than once per game.

Table 4 gives the output for survival analysis that includes an indicator for being before or after a quick substitution in the conditional risk set model for DeMarcus Cousins, Al Horford, Robin Lopez, and all centers pooled. The coefficient on QS for all players examined is negative, indicating that a quick substitution is associated with a lower chance that a player will foul at any time after the substitution. However, for all players examined, it is not always a significant difference. Quick substitutions seem to be associated with a reduction in Al Horford’s foul tendencies, though not significantly and the effect size is smaller for Horford than for Cousins. The players still have significant positive coefficients for later fouls, indicating that while they may still “tilt,” the QS may mitigate some of it.

screen-shot-2017-02-24-at-3-20-46-pm
Table 4: Conditional risk set model for ordered events for Cousins, Horford, Lopez, and all centers pooled, including an indicator for whether a foul was after a quick substitution (QS). For Cousins, a QS is associated with a e^{-0.58}\approx56\% reduction in foul chance.

Focusing on Cousins, Table 5 displays the survival model output for all fouls before a QS and after a QS side-by-side to facilitate comparison. The analysis of fouls before a quick substitution shows a significant increase in the chance that he commits a foul once he has 3 or 4 fouls. However, after the substitution, the coefficients are smaller, indicating that he is no longer as “tilted.” We visualize this change in foul behavior in Figures 3a and 3b which show the survival curves before and after a quick substitution. Cousins’s foul tendencies prior to a QS (Figure 3a) are similar to those seen across a whole game (Figure 2a). However, after a QS (Figure 3b), there is much less of a stark contrast. He does appear to commit his 4th fouls faster than his 3rd, but not as significantly as before the QS.

screen-shot-2017-02-24-at-3-36-48-pm
Table 5: Conditional risk set model for ordered events for Cousins before and after a quick substitution (QS). All coefficients are decreased indicating that a QS could potential mitigate his aggressive fouling tendencies.
screen-shot-2017-02-24-at-3-39-03-pm
Figure 3: Survival curves for the first 4 fouls for Cousins, before and after a quick substitution (QS), when he commits a minimum of 5 fouls. Prior to the QS, Cousins has clear ordering, but it is disrupted after.

Al Horford, by contrast does not seem to be significantly affected by a QS, though throughout we have seen that Horford does not seem to “tilt” as much as other centers in general.  Figures 4a and 4b show Horford’s survival curves before and after a quick substitution. While there may be some distinction between the fouls before a QS, it is not as extreme as seen with Cousins, and there is certainly little order after a QS. Al Horford simply does not foul, “tilt,” or get affected by quick substitutions as much as other centers.

screen-shot-2017-02-24-at-3-40-41-pm
Figure 4: Survival curves for the first 4 fouls for Horford, before and after a quick substitution (QS), when he commits a minimum of 5 fouls. Horford shows no strong trends before or after a QS.

Discussion – Further Research

While we focused on only centers for this research, the methods used here can easily be used for all players in the NBA to identify players who “tilt.” In addition to looking at quick substitutions, it would be interesting to note other events which may reduce the effect of a “tilting” player, particularly other stoppages of play like timeouts or breaks in a period. We chose to look at substitutions shortly after a foul in the hopes of best capturing a direct coaching reaction to the foul. A timeout following shortly after a foul may also reflect a direct reaction to the foul and is a clear avenue for further analysis. Furthermore, while we only considered personal fouls in this study, it would be interesting to note how technical fouls play a role in “tilting” players. Technical fouls are especially interesting since they are rarely a part of strategy in the way a normal personal foul can be. Our overall aim is to examine players who are considered by many to be emotional, so how these players accumulate, or their teammates accumulate, technical fouls may have an impact on their foul rates and overall “tilt.” Additionally, we only adjusted for time and score, but there are many other factors that could be included such as the player being guarded (a player may be more likely to “tilt” against players who tend to play more aggressively or are known trash talkers) or if the rate at which the player of interest draws fouls (players may become more upset if they feel they are not receiving foul calls on their behalf). Moreover, while this paper was limited to a select few centers, the methods could easily be applied to all NBA players. Expanding the number of players analyzed would allow for greater understanding of how different players and positions accrue fouls.

Finally, we did not do any causal inference. Any effects we see are just associations. Proper causal inference analysis is a clear area for further research.

Conclusion

In this analysis, we used a survival model for fouls to show that fouling rates are not always independent of the number of fouls a player has accumulated. Emotional players, such as DeMarcus Cousins, often “tilt”, increasing the likelihood of committing another foul as they accrue more fouls. Our analysis also indicates that quickly substituting a player could influence an emotional player’s foul rate, reducing the likelihood of them picking up another foul.

We cannot say for certain the precise reason why a quick substitution has an effect. It could be that taking a player out of the game gives him time to calm down and become level-headed. However, it may also be related to the common strategy of attacking a player that is in “foul trouble”, often defined as approaching 3 fouls by halftime or 6 by the end of the game. Before the player is substituted, he may be in “foul trouble” causing the opposing team to attempt to draw a foul against him. After a QS and the player returns to the game, there is less incentive to attack since he is no longer in “foul trouble” due to the passage of game time. It may well be that a QS is simply a good indicator of keeping that player from being attacked. This hypothesis certainly merits further investigation.

While the scope of this paper is somewhat limited, we hope it will encourage others to explore the process by which players accrue fouls. We believe that further research in this area will reveal new insights into how players can remain effective throughout the game, especially if something as simple as a coach making a quick substitution can have such a significant impact. It may not be easy to stop “tilting” entirely, but there are ways to mitigate the effects.

 

NBA Fouls – Survival Analysis

This is Part 3 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.
Part 2 can be found here.
Part 1 can be found here.
I strongly recommend reading at least Part 2 before continuing as I reference it.

Survival Analysis

To provide more statistical rigor, we analyze our players using a conditional risk set model for ordered events. This model, first proposed by Prentice, Williams, and Peterson, models the hazard at each foul event time as a function of the current number of fouls accumulated and time since the last foul. The model is flexible and can include other covariates as needed. For this paper, our covariates include the lead or deficit in the score of the player’s team, game time in minutes, and an interaction between the two. We chose these covariates, as we believe that a closer game can have an impact on a player’s fouling rates. We include actual game time in minutes to reflect how close the game is to ending, and to account for potential overtime periods.

Let X_{ki} and C_{ki} be the foul and censoring time for the kth foul (k=1, 2, …,6) in the ith game and let Z_{ki} be the vector of covariates for the ith game and with respect to the kth foul. We assume X_{ki} and C_{ki} are independent given Z_{ki}. We then define T_{ki}=\min(Z_{ki},C_{ki}) and let \beta be a vector of unknown regression coefficients.  Under the proportional hazard assumption, the hazard function of the ith game and for the kth foul is:

\lambda_{k}(t,Z_{ki})=\lambda_{0}\left(t\right)e^{\beta Z_{ki}}

From Table 2, we can see that the difference in score plays a minimal impact on player fouling rates, even after adjusting for game time for Cousins, Horford, and Lopez. Closer games do not seem to cause more fouls to be committed. However, the total game time that has been played has an impact. Furthermore, as time goes on, it appears that players are less likely to foul. This trend holds true for our three players of interest and all players when pooled together, which is surprising considering that players are more likely to foul later in the game. With this analysis, it shows that players are more likely to foul if they have already fouled as the game goes on. If a player has not fouled already in the game, they are less likely to foul since time plays a negative relationship with likelihood to foul. This trend holds true for all centers we analyzed. These results are line with what we saw in Figure 1. Moreover, these results are similarly likely due to the selection bias we have that precludes us from seeing every foul in every game.

screen-shot-2017-02-16-at-11-19-45-am
Table 2: Conditional risk set model for ordered events for Cousins, Horford, Lopez, and all centers pooled. Foul rates increase and are significant. Game time decreases foul rates overall.

As before, we can limit our analysis to games where the players had at least 5 fouls, and examine analysis of the first four fouls. Table 3 displays the survival model output for Cousins, Horford and Lopez when we use the restricted dataset. For all players, fouls 2, 3, and 4 are committed significantly sooner than the prior foul. To find the hazard ratios associated with each foul, we exponentiate the difference in the coefficients since each coefficient is with respect to the baseline of the 1st foul. For example, when Cousins has 3 fouls he is 405% more likely to commit a foul at any given time than when he only has 2 fouls. Cousins is 303% more likely to commit a foul when he has four fouls compared to when he only has three. Although the hazard ratios increase dramatically with each foul, it is important to keep in mind that the initial probability of fouling at any given moment is low, as the initial foul takes nearly 500 seconds (over 8 minutes) to take place on average for DeMarcus Cousins.

It is interesting to note that the opposite effect happens with game time. As each minute passes in the game, Cousins is only 90% as likely to commit a foul as the previous minute. This trend holds for all players.

screen-shot-2017-02-16-at-11-22-56-am
Table 3: Conditional risk set model for ordered events for Cousins, Horford, Lopez and all centers pooled when the player commits a minimum of 5 fouls. Foul rates increase and are significant.

From the table, we can see that although all players seem to have this “tilting” behavior, DeMarcus Cousins has a higher likelihood of committing a foul than other players as he accrues fouls. Cousins seems to “tilt” more than others centers in our analysis. Part of this behavior may be explained by teams attacking players who already have many fouls, attempting to get them in foul trouble. However, we believe that no one factor can tell the complete story.

Part 4 of this series can be found here 

Catch & Shoot – EFG%

This is Part 2 of my series on Catch & Shoot jumpers
Part 1 can be found here

Last time, we ended by looking at a basic logistic regression predicting success of a shot, conditioned on whether a shot was: a catch & shoot (C&S), a three point attempt, and open. This time we will start considering effective field goal percentage (EFG%), which gives an additional bonus to three point shots.

For anybody unaware of the difference between FG% and EFG%, here is the brief but informative definition from basketball reference:

“Effective Field Goal Percentage; the formula is (FG + 0.5 * 3P) / FGA. This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal. For example, suppose Player A goes 4 for 10 with 2 threes, while Player B goes 5 for 10 with 0 threes. Each player would have 10 points from field goals, and thus would have the same effective field goal percentage (50%).”

Let’s start our investigation into EFG% by comparing EFG% to FG% for C&S vs pull-up jumpers split out between all shots and just 3 point shots.

FG% All shots 3 point 2 point
C&S 39.3% 37.4% 43.3%
Pull-Up 34.9% 29.9% 37.0%
EFG% All shots 3 point 2 point
C&S 52.0% 56.1% 43.3%
Pull-Up 39.2% 44.8% 37.0%

By using EFG% instead of FG% it becomes much clearer that C&S is  a better shot than a pull-up jumper.

We could also split the data by whether or not these shots were open (as we first saw in part 1).

EFG% All shots Open Defended
C&S 52.0% 53.3% 45.9%
Pull-Up 39.2% 41.4% 36.6%

We see that, of course, open shots are better than defended shots. However we can also see that using EFG% shows that a defended C&S is better than an open pull-up. Even without seeing the raw numbers, we suspect these results come from a large number of C&S shots being 3-point attempts.

So we could stratify further and look at C&N vs 3-point vs openness. And while it would be easy to make a number of stratified 2×2 tables, at a certain point it makes more sense to just use a model and account for as many possible variables that could effect FG% or EFG%. Which is not to say that examining raw percentages is a bad idea. After all, tables are a simple way to compare different kinds of shots, and since we have a large number of shots, we won’t really run into any sparsity problems.

But I don’t want to spend too long just looking at basic statistics. So, let’s continue down our previous path of looking at a simple regression to predict shot success, and see how we can improve it. However, we quickly run into two potential problems.

The first problem is one we touched on previously – looking at confounders. We want to understand variables that effect whether a shot is successful and that effect a players decision to take a specific type of shot. Last time we looked at defender distance as a potential confounder. This time we will also consider the shot clock. If there are only a few seconds left on the clock, a player may not have time to drive to the basket, and will have to just shoot. For future analyses, I’d want to explore other variables that are potential confounders such as game time remaining, the score, and who the closest defender is. But for now, let’s keep things relatively simple.

The second problem we will face is more complicated – how do model EFG%? Modeling FG% is easy because our outcome is binary, a shot is successful or not. Logistic regression requires a binary outcome, so we can’t just give successful 3 point shots an outcome of 1.5. Most statistics software will allow us to use weights in a quasibinomial framework, but I can’t think of a good way to use weights to get at EFG%. Weights are used to create pseudo-populations that up-weight or down-weight certain shots depending on how representative they are. The problem with giving a successful 3 point shot a weight of 1.5 is that it doesn’t make the outcome 1.5, rather it increases the representation of the characteristics of that shot.

If anybody has a way to examine EFG% using a weighted regression, please let me know. I only spent a few days thinking about this and while I have a work around, I would love to be able to show this analysis just use a simple regression framework. But I cannot, for the life of me, think of a way to do it. I tried for a while to reframe the problem by using functionals instead of trying to target a regression parameter, but I still don’t think it works.

So what is my work around? Don’t look at EFG%. Instead split out 3 point shots and 2 point shots and examine them separately. 3 point shots and 2 point shots are different enough that trying to pool them into a single population will obscure the differences and lead to analytical problems. Especially since it may be naive to assume a constant treatment effect of C&S for both 2-point and 3-point shots. We could also split out the two kinds of shots and instead look at the expected number of points per shot. Stephen Shea has touched on this, which makes me think it is a good avenue for further investigation.

On a more philosophical level, there always seems to be this strong desire to collapse everything down to single number. We see this a lot when we try to invent statistics that fully capture how good a player is with one number. And while I agree there is value in a single statistic, I also think there is value in nuance and increased granularity. My goal is to examine C&S shots, and there is no harm in splitting that out by shot value.

But I freely admit that I may be missing something obvious and there is an easy way to use EFG%. Again, if you have any ideas, please let me know.

Next time in this series, we will dive into causal effects of catch & shoot vs pull-up jumpers.

Part 3 can be found here

NBA Fouls – Data, basic stats and visualizations

This is part 2 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.
Part 1 can be found here

I’ll be pulling edited sections from the paper I wrote with Udam Saini for the 2017 Sloan Sports Analytics Conference research paper competition. A full, finalized version of the paper will be available at a later date.

The goal of this project is to examine how NBA players accrue fouls and if it is possible to mitigate their foul tendencies through simple coaching decisions. Let’s start with getting some data and looking at basic foul rates.

Data

We examine play-by-play data and box-score data from the NBA for the 2011-2012, 2012-2013, 2013-2014, 2014-2015, and 2015-2016 seasons. This data is publicly available from http://www.nba.com. The play-by-play contains rich event data for each game. The box-score includes data for which players started the game, and which players were on the court at the start of a quarter. Data, in csv format, can be found here.

Using the box-score data and substitutions in the play by play for each game, we can determine the amount of time any given player has actively played in the current game at each event in the play by play data.  We look at only active player time, rather than game time within a game to accurately determine how often a player commits foul. Most discussion of time throughout discusses only actual play time; that is, individual person time for each player. Using player play time should control for substitution patterns, as a player in foul trouble will likely not play until later in the game. If we used game time, it would artificially increase time between fouls. Additionally, censoring times for each player in each game were generated.  For example if a player only committed 3 fouls in a game, an entry was generated for his 4th  foul, with foul time equal to the max player time and an indicator that the foul did not occur. This is important as we need to account for censored fouls in our analysis.

For now, let us only consider only centers in our analysis, to minimize effects of fouling patterns between different NBA positions. Overall, we will further limit ourselves to Al Horford, Andrew Bogut, Brook Lopez, DeMarcus Cousins, Dwight Howard, Marc Gasol, Robin Lopez, and Tyson Chandler.
In our analysis, we will focus on DeMarcus Cousins, Al Horford, and Robin Lopez as these three centers exhibit three distinct trends that we see in other centers that we analyzed. All centers considered share many of the same characteristics in our analysis as well.

Summary Statistics

Even simple analysis and statistics can give us some insight into how NBA players accrue up to 6 personal fouls over the course of a game. Table 1 gives a few summary statistics. Table 1a gives basic statistics for most of DeMarcus Cousins’ fouls from the 2011-2016 season. We can see that on average, Cousins commits his 1st personal foul after about 500 seconds (or about 8 minutes and 20 seconds) of his personal playing. By contrast, he commits his 4th foul about 300 seconds (or about 5 minutes) of personal playing time after committing his 3rd foul. Table 1b gives the same statistics for Al Horford where we see that his 1st foul comes after an average of about 823 seconds while he commits his 4th foul an average of about 311 seconds after his 3rd. From these numbers, it might appear that Horford is more “tilted” given his time between fouls shrinks more than Cousins.

screen-shot-2017-02-04-at-10-12-53-am
Table 1: Summary statistics for Cousins, Horford, Lopez and all centers pooled. Gives the average time to each foul by number, the number of games with exactly that many fouls, and the number of games with at least that many fouls. Cousins had 11 games with only 1 foul, but 115 with 5 or more.

However, the tables also show that Horford had 80 games in which he only recorded a single foul and only 37 games where he recorded 4 or more fouls. By contrast, Cousins had only 11 games with a single foul and 193 games with 4 or more. Because games often end before a player commits all six fouls, many of the foul times are right censored by the end of the game. These foul times are not included in simple summary statistics and therefore merely examining the average time to foul does not accurately reflect all the differences between players or how those players individually accrue fouls.

Visualization – Survival Curves

Next, we visualize foul rates for each player by using Kaplan Meier survival curves. A survival curve, in general, is used to map the length of time that elapses before an event occurs. Here, they give the probability that a player has “survived” to a certain time without committing a particular number foul. These curves are useful for understanding how a player accrues fouls while accounting for the total length of time during which a player is followed, and allows us to compare how the different fouls are accrued.

Screen Shot 2017-02-04 at 10.19.02 AM.png
Figure 1: Survival curves for Cousins, Horford and Lopez. Displays the probability that a player has “survived” to a certain time without committing a particular foul by number. These curves include fouls that at censored by the game ending which obscures some of the patterns.

Figure 1a gives the overall survival curves for Cousins. From the graph, there appears to be some evidence his time to foul decreases as he accrues fouls because there is layering between the fouls. While the trend may seem small, it is much starker than that for other centers, as we can see in Figures 1b for Al Horford and 1c for Robin Lopez. Their curves appear much more random. The survival curve for Al Horford’s 6th foul seems abnormal, along with Robin Lopez to a smaller extent. This abnormality is likely explained by the small sample sizes for 6 fouls as seen in Table 1.

I’d like to note here that it is important to use survival curves in this scenario as it accounts for censoring. If we were to just look at the densities of fouls for a given player, we might falsely see a very different trend. Figure 2 shows raw foul densities for DeMarcus Cousins, and there is a clear ordering for the fouls.

demarcus-cousins-foul-densities
Figure 2: Foul densities for DeMarcus Cousins. From this graph it would appear that there is a big difference in foul times. However these graphs do not take censoring into account

As mentioned above, if games were infinitely long, and players continued to play, we would observe every player until he committed his 6th personal foul and was removed from the game. As games are of finite length, many fouls are censored due the end of follow up time. Therefore, it makes sense that the 5th and 6th fouls would be subject to sampling bias. For example, if the 5th foul is committed with 4 minutes left in the game, we will never observe a 6th foul that comes 5 minutes later. To help adjust for this censoring, we considered limiting analysis to only games where all 6 fouls were committed. However, this limitation severely restricts the sample sizes for all players. Instead, we will examine games for each player where they committed a minimum of five fouls and limit our analysis to the first four fouls. This foul restriction gives us a larger sample size, though restricts us from gaining understanding about how players accrue their 5th and 6th fouls.

Screen Shot 2017-02-04 at 10.23.14 AM.png
Figure 3: Survival curves for the first 4 fouls for Cousins, Horford, and Lopez when the player commits a minimum of 5 fouls. Cousins has a clear ordering to how he commits these fouls given that his time to foul decreases as he accrues fouls. Horford does not display this trend.

Figures 3a, 3b, and 3c show the 5 foul minimum survival curves for Cousins, Horford, and Lopez. Cousins displays much clearer ordering, where the more fouls he accrues, the more likely he is to foul. However, Horford and Lopez show much less distinction between the fouls. Lopez shows some ordering, especially for his 4th foul, but Horford’s curves are fairly random.

Of course we are not controlling for nearly enough variables and the sample size is sadly limited. A full discussion of areas for further research will be discussed later. However, for now we now have a nice way to model and visualize fouls so we can understand them better moving forward.

Part 3 of this series can be found here 

Catch & Shoot – Basics

I’ve been interested in catch & shoot (C&S) jump shots for a while now, pretty much ever since I read this article by Stephen Shea. There is this idea that a C&S is better than a pull-up shot. And based on everything I’ve read and analyzed, this holds true. This lead me to ask “what is it about a C&S that makes it such a superior shot?”

To answer this question, I started with the NBA definition of catch & shoot, namely “Any jump shot outside of 10 feet where a player possessed the ball for 2 seconds or less and took no dribbles.” Let’s break this definition down a little.
– A shot past 10 feet  is further from the basket, which I would think decreases the likelihood of success. However we can assume that when comparing to pull-up jumpers, the comparison group is past 10 feet as well. When the time come to analyze data, we will just have to restrict to shots past 10 feet. No problem.
– Taking no dribbles means a player doesn’t need to take time to collect the ball or fight the momentum of movement. It makes sense that not having to dribble would increase the chance of a making a basket.
– On the surface, possessing the ball for 2 seconds or less seems like it would give a player less time to set himself and shoot. Therefore I would assume short possession time would decrease the probability of success. However, a shorter possession means that the defense has less time to get in position as well. And this is where I get irked about the definition of catch & shoot.

When I began to look at C&S vs pull-up jumpers I hypothesized that the effect of the defense was confounding the effect of a C&S. I still thought a C&S shot would be better than an pull-up, but without controlling for defense, I was skeptical of how much better.

Let’s take a step back and define some terms. If we think of a C&S as a “treatment” ( my education and training are in public health so I often default to medical terms) compared to a “control” shot being a pull-up jumper, and shot success as the outcome, then our goal is to examine the treatment effect of a CnS on the outcome. We can also say that the type of shot is the independent variable, and success of the shot is the dependent variable. But shots are not randomized to be CnS or pull-up, so just looking at raw numbers won’t necessarily give us the full picture. A “confounder” is any variable that effects both the treatment and the outcome. Defense is very likely a confounder for any shot as a player is more like to take and more likely to make a wide open shot. Similarly, he is less likely to take and less likely to make a highly contested shot.

The NBA has a really great statistics site, http://stats.nba.com/. However, it doesn’t get to the granularity that I want. Thankfully www.nbasavant.com does. I went and pulled 50,000 shots from the 2014-2015 season. I chose that season because starting in 2016, defender data isn’t available. I also restricted to players who took at least 100 jump shots. I ended up with 50,000 shots (I assume this size is preset), which I then further restricted to 38,384 jump shots that were over 10 feet.

Of these 38,384 shots 21,917 had zero dribbles and a possession of 2 seconds or less and thus were defined as a catch & shoot. The remaining 16,467 shots were labeled as pull-up shots.

One more definition, for the purposes of this analysis (and future analyses), I consider any shot where the closest defender is more than 4 feet away to be an “open” shot. All other shots are considered “defended.”

Here are some basic statistics:

Counts Defended Open
C&S 4049 17868
Pull-Up 7468 8999
Percent Defended Open
C&S 10.54% 46.55%
Pull-Up 19.46% 23.44%

Most shots are open C&S. The majority of C&S shots are open, and the majority of open shots are C&S.

I could split this out further and look at 2pt shots vs 3 pt shots, but let’s skip that for now and put it into a simple logistic regression modeling the probability a shot is successful. We can include an indicator for whether or not a shot was a 3pt attempt:

Estimate Std. Error z value Pr(>|z|)
Intercept -0.5467 0.0175 -31.23 0.0000
C&S 0.2955 0.0233 12.70 0.0000
3pt -0.2727 0.0230 -11.88 0.0000

Now let’s control for wether a shot was open or not:

Estimate Std. Error z value Pr(>|z|)
Intercept -0.6305 0.0217 -29.06 0.0000
C&S 0.2597 0.0239 10.88 0.0000
open 0.1626 0.0245 6.62 0.0000
3pt -0.2927 0.0232 -12.64 0.0000

Side note: I try not to pay too much attention to p-values, preferring to focus on effect sizes.

We see that C&S does increase the chance of success, though the effect is mitigated when we also control for whether or not a shot was open. I also fit the models with various interactions, but they had little impact (in fact the interaction between C&S and open was negative, though small).

This is why I am always a little irked by the definition of a catch & shoot – it doesn’t account for how open the player is. A player with a high C&S  FG% who mostly takes open shots should be evaluated differently than a player with a similarly high C&S  FG% who takes mostly defended shots. Yet raw C&S FG% or C&S EFG% obscures this difference.

We are just getting started here. After all,  so far I’ve only presented simple regressions with some (confusing and hard to interpret) log odds ratio coefficients. In future posts I will get into how we can estimate the causal effect of a catch & shoot for only for raw field goal percentage, but also effective field goal percentage, which will give a bonus to three point shots.

Part 2 of this series can be found here