Inner statistician reaction:

Oh God, Sloan is going to be *insufferable* this year, isn’t it?

Inner sports fan reaction:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHHHHHHHHH WOW WOW WOW WOW WHAT A GAME!!!!!!

Skip to content
# CausalKathy

## Super Bowl Reactions

## NBA Fouls – Data, basic stats and visualizations

## Catch & Shoot – Basics

## NBA Fouls – I Love DeMarcus Cousins

## Welcome

Causal inference and other statistics in the world of sports

Inner statistician reaction:

Oh God, Sloan is going to be *insufferable* this year, isn’t it?

Inner sports fan reaction:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHHHHHHHHH WOW WOW WOW WOW WHAT A GAME!!!!!!

This is part 2 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.

Part 1 can be found here

I’ll be pulling edited sections from the paper I wrote with Udam Saini for the 2017 Sloan Sports Analytics Conference research paper competition. A full, finalized version of the paper will be available at a later date.

The goal of this project is to examine how NBA players accrue fouls and if it is possible to mitigate their foul tendencies through simple coaching decisions. Let’s start with getting some data and looking at basic foul rates.

**Data**

We examine play-by-play data and box-score data from the NBA for the 2011-2012, 2012-2013, 2013-2014, 2014-2015, and 2015-2016 seasons. This data is publicly available from http://www.nba.com. The play-by-play contains rich event data for each game. The box-score includes data for which players started the game, and which players were on the court at the start of a quarter. Data, in csv format, can be found here.

Using the box-score data and substitutions in the play by play for each game, we can determine the amount of time any given player has actively played in the current game at each event in the play by play data. We look at only active player time, rather than game time within a game to accurately determine how often a player commits foul. Most discussion of time throughout discusses only actual play time; that is, individual person time for each player. Using player play time should control for substitution patterns, as a player in foul trouble will likely not play until later in the game. If we used game time, it would artificially increase time between fouls. Additionally, censoring times for each player in each game were generated. For example if a player only committed 3 fouls in a game, an entry was generated for his 4^{th} foul, with foul time equal to the max player time and an indicator that the foul did not occur. This is important as we need to account for censored fouls in our analysis.

For now, let us only consider only centers in our analysis, to minimize effects of fouling patterns between different NBA positions. Overall, we will further limit ourselves to Al Horford, Andrew Bogut, Brook Lopez, DeMarcus Cousins, Dwight Howard, Marc Gasol, Robin Lopez, and Tyson Chandler.

In our analysis, we will focus on DeMarcus Cousins, Al Horford, and Robin Lopez as these three centers exhibit three distinct trends that we see in other centers that we analyzed. All centers considered share many of the same characteristics in our analysis as well.

**Summary Statistics**

Even simple analysis and statistics can give us some insight into how NBA players accrue up to 6 personal fouls over the course of a game. Table 1 gives a few summary statistics. Table 1a gives basic statistics for most of DeMarcus Cousins’ fouls from the 2011-2016 season. We can see that on average, Cousins commits his 1^{st} personal foul after about 500 seconds (or about 8 minutes and 20 seconds) of his personal playing. By contrast, he commits his 4^{th} foul about 300 seconds (or about 5 minutes) of personal playing time after committing his 3^{rd} foul. Table 1b gives the same statistics for Al Horford where we see that his 1^{st} foul comes after an average of about 823 seconds while he commits his 4^{th} foul an average of about 311 seconds after his 3^{rd}. From these numbers, it might appear that Horford is more “tilted” given his time between fouls shrinks more than Cousins.

However, the tables also show that Horford had 80 games in which he only recorded a single foul and only 37 games where he recorded 4 or more fouls. By contrast, Cousins had only 11 games with a single foul and 193 games with 4 or more. Because games often end before a player commits all six fouls, many of the foul times are right censored by the end of the game. These foul times are not included in simple summary statistics and therefore merely examining the average time to foul does not accurately reflect all the differences between players or how those players individually accrue fouls.

**Visualization – Survival Curves**

Next, we visualize foul rates for each player by using Kaplan Meier survival curves. A survival curve, in general, is used to map the length of time that elapses before an event occurs. Here, they give the probability that a player has “survived” to a certain time without committing a particular number foul. These curves are useful for understanding how a player accrues fouls while accounting for the total length of time during which a player is followed, and allows us to compare how the different fouls are accrued.

Figure 1a gives the overall survival curves for Cousins. From the graph, there appears to be some evidence his time to foul decreases as he accrues fouls because there is layering between the fouls. While the trend may seem small, it is much starker than that for other centers, as we can see in Figures 1b for Al Horford and 1c for Robin Lopez. Their curves appear much more random. The survival curve for Al Horford’s 6^{th} foul seems abnormal, along with Robin Lopez to a smaller extent. This abnormality is likely explained by the small sample sizes for 6 fouls as seen in Table 1.

I’d like to note here that it is important to use survival curves in this scenario as it accounts for censoring. If we were to just look at the densities of fouls for a given player, we might falsely see a very different trend. Figure 2 shows raw foul densities for DeMarcus Cousins, and there is a clear ordering for the fouls.

As mentioned above, if games were infinitely long, and players continued to play, we would observe every player until he committed his 6^{th} personal foul and was removed from the game. As games are of finite length, many fouls are censored due the end of follow up time. Therefore, it makes sense that the 5^{th} and 6^{th} fouls would be subject to sampling bias. For example, if the 5^{th} foul is committed with 4 minutes left in the game, we will never observe a 6^{th} foul that comes 5 minutes later. To help adjust for this censoring, we considered limiting analysis to only games where all 6 fouls were committed. However, this limitation severely restricts the sample sizes for all players. Instead, we will examine games for each player where they committed a minimum of five fouls and limit our analysis to the first four fouls. This foul restriction gives us a larger sample size, though restricts us from gaining understanding about how players accrue their 5^{th} and 6^{th} fouls.

Figures 3a, 3b, and 3c show the 5 foul minimum survival curves for Cousins, Horford, and Lopez. Cousins displays much clearer ordering, where the more fouls he accrues, the more likely he is to foul. However, Horford and Lopez show much less distinction between the fouls. Lopez shows some ordering, especially for his 4^{th} foul, but Horford’s curves are fairly random.

Of course we are not controlling for nearly enough variables and the sample size is sadly limited. A full discussion of areas for further research will be discussed later. However, for now we now have a nice way to model and visualize fouls so we can understand them better moving forward.

I’ve been interested in catch & shoot (C&S) jump shots for a while now, pretty much ever since I read this article by Stephen Shea. There is this idea that a C&S is better than a pull-up shot. And based on everything I’ve read and analyzed, this holds true. This lead me to ask “what is it about a C&S that makes it such a superior shot?”

To answer this question, I started with the NBA definition of catch & shoot, namely “Any jump shot outside of 10 feet where a player possessed the ball for 2 seconds or less and took no dribbles.” Let’s break this definition down a little.

– A shot past 10 feet is further from the basket, which I would think decreases the likelihood of success. However we can assume that when comparing to pull-up jumpers, the comparison group is past 10 feet as well. When the time come to analyze data, we will just have to restrict to shots past 10 feet. No problem.

– Taking no dribbles means a player doesn’t need to take time to collect the ball or fight the momentum of movement. It makes sense that not having to dribble would increase the chance of a making a basket.

– On the surface, possessing the ball for 2 seconds or less seems like it would give a player less time to set himself and shoot. Therefore I would assume short possession time would decrease the probability of success. However, a shorter possession means that the defense has less time to get in position as well. And this is where I get irked about the definition of catch & shoot.

When I began to look at C&S vs pull-up jumpers I hypothesized that the effect of the defense was confounding the effect of a C&S. I still thought a C&S shot would be better than an pull-up, but without controlling for defense, I was skeptical of how much better.

Let’s take a step back and define some terms. If we think of a C&S as a “treatment” ( my education and training are in public health so I often default to medical terms) compared to a “control” shot being a pull-up jumper, and shot success as the outcome, then our goal is to examine the treatment effect of a CnS on the outcome. We can also say that the type of shot is the independent variable, and success of the shot is the dependent variable. But shots are not randomized to be CnS or pull-up, so just looking at raw numbers won’t necessarily give us the full picture. A “confounder” is any variable that effects both the treatment and the outcome. Defense is very likely a confounder for any shot as a player is more like to take and more likely to make a wide open shot. Similarly, he is less likely to take and less likely to make a highly contested shot.

The NBA has a really great statistics site, http://stats.nba.com/. However, it doesn’t get to the granularity that I want. Thankfully www.nbasavant.com does. I went and pulled 50,000 shots from the 2014-2015 season. I chose that season because starting in 2016, defender data isn’t available. I also restricted to players who took at least 100 jump shots. I ended up with 50,000 shots (I assume this size is preset), which I then further restricted to 38,384 jump shots that were over 10 feet.

Of these 38,384 shots 21,917 had zero dribbles and a possession of 2 seconds or less and thus were defined as a catch & shoot. The remaining 16,467 shots were labeled as pull-up shots.

One more definition, for the purposes of this analysis (and future analyses), I consider any shot where the closest defender is more than 4 feet away to be an “open” shot. All other shots are considered “defended.”

Here are some basic statistics:

Counts | Defended | Open |
---|---|---|

C&S | 4049 | 17868 |

Pull-Up | 7468 | 8999 |

Percent | Defended | Open |
---|---|---|

C&S | 10.54% | 46.55% |

Pull-Up | 19.46% | 23.44% |

Most shots are open C&S. The majority of C&S shots are open, and the majority of open shots are C&S.

I could split this out further and look at 2pt shots vs 3 pt shots, but let’s skip that for now and put it into a simple logistic regression modeling the probability a shot is successful. We can include an indicator for whether or not a shot was a 3pt attempt:

Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|

Intercept | -0.5467 | 0.0175 | -31.23 | 0.0000 |

C&S | 0.2955 | 0.0233 | 12.70 | 0.0000 |

3pt | -0.2727 | 0.0230 | -11.88 | 0.0000 |

Now let’s control for wether a shot was open or not:

Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|

Intercept | -0.6305 | 0.0217 | -29.06 | 0.0000 |

C&S | 0.2597 | 0.0239 | 10.88 | 0.0000 |

open | 0.1626 | 0.0245 | 6.62 | 0.0000 |

3pt | -0.2927 | 0.0232 | -12.64 | 0.0000 |

Side note: I try not to pay too much attention to p-values, preferring to focus on effect sizes.

We see that C&S does increase the chance of success, though the effect is mitigated when we also control for whether or not a shot was open. I also fit the models with various interactions, but they had little impact (in fact the interaction between C&S and open was negative, though small).

This is why I am always a little irked by the definition of a catch & shoot – it doesn’t account for how open the player is. A player with a high C&S FG% who mostly takes open shots should be evaluated differently than a player with a similarly high C&S FG% who takes mostly defended shots. Yet raw C&S FG% or C&S EFG% obscures this difference.

We are just getting started here. After all, so far I’ve only presented simple regressions with some (confusing and hard to interpret) log odds ratio coefficients. In future posts I will get into how we can estimate the causal effect of a catch & shoot for only for raw field goal percentage, but also effective field goal percentage, which will give a bonus to three point shots.

This is part 1 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.

If you’ve ever talked about the NBA with me for any significant amount of time, you will know that one of my favorite players is DeMarcus Cousins (my favorite player being, of course, Shaun Livingston). I’ve always liked Boogie for the usual reasons related to his abilities on the court, of course, but also because he has been the focus and inspiration of much of my analytics work for the past year. Before I get into all the details and numbers, I’d like to share the story of how it all came to pass.

I am from Berkeley, California. As such, my home team is the Golden State Warriors. Last school year, during my Christmas break I was able to attend the December 28th 2015 matchup between the Warriors and Sacramento Kings at Oracle Arena. Many people remember that game because the end of the first half featured a three-point shoot out between Stephen Curry and Omri Casspi:

It was incredibly exciting and made for a close game.

A few months later, at the 2016 Sloan Sports Analytics Conference, that sequence came up in a conversation with my friend who was also in attendance. I mentioned that I was at that game, but that the end of the first half wasn’t what I remembered most about the game. What I remember is this:

It was my first time seeing Cousins get ejected and even from my seat in the upper bowl, I could feel how frustrated and upset he was about the whole ordeal.

My friend commented that “if only Boogie could be, like, 15% less angry – he would be the most dominant player in the game.” Which of course got me thinking – how *would* you quantify how mad DeMarcus Cousins is at any given time?

A potential answer presented itself several months later at the 2017 Joint Statistical Meetings where I attended a session titled “For the Love of the Game: Applications of Statistics in Sports.” In that session, Douglas VanDerwerken presented “Does the Threat of Suspension Curb Dangerous Behavior in Soccer? A Case Study from the Premier League.” This paper (which can be found here for interested readers) showed that as EPL players approach the yellow card limit, and thus face suspension, they are less likely to foul.

Thinking back to that December 28th game, and many additional Kings games I have watched, it seemed to me that Cousins would get heated and “tilted” and play more aggressively and therefore foul more often. I hypothesized that the more Cousins fouls the more likely he was to foul.

He does. But he’s not the only one.

I’ll get into the math/stats in a later post. But here is a general idea of how we can think about this problem. Given there is a fixed amount of time that a given player is on the court, we might expect fouls to follow a Poisson arrival process with inter-arrival times following an Exponential distribution where each foul is independent of the previous fouls. We can consider a survival model, and look at the “failure time” for each foul – in other words the time it takes a player to commit his 1st found, 2nd foul, etc. If, for example, the time between the 2nd and 3rd foul is significantly longer than than the time between the 4th and 5th foul, we would have evidence of some sort of “tilt.” We can model foul rates using a conditional risk set model for ordered events and do some analysis with a stratified Cox model. From there we can try to identify if there are any actions a coach/team can take in order to mitigate increased fouling rates.

I’ll save the details for later.

Part 2 of this series can be found here

Every time I attend a talk about sports statistics, or read an interview, or talk to people in the field, inevitably the question of “how do I break into the industry” arises. And inevitably, the answer includes “put your stuff out there.”

Per that advice, this site will be my chance to share the work I have done in sports analytics.

Though to be fair, I was sorely tempted to purposefully *not* make a blog/site in order to act as a control of sorts. But I really did want to share some of the fun things I’ve been working on and solicit ideas for how to improve.

This site will evolve over time – depending on how focused on my dissertation I am at any given time. It will focus mostly on sports statistics/analytics, but really anything is game.

Enjoy!