Catch & Shoot – EFG%

This is Part 2 of my series on Catch & Shoot jumpers
Part 1 can be found here

Last time, we ended by looking at a basic logistic regression predicting success of a shot, conditioned on whether a shot was: a catch & shoot (C&S), a three point attempt, and open. This time we will start considering effective field goal percentage (EFG%), which gives an additional bonus to three point shots.

For anybody unaware of the difference between FG% and EFG%, here is the brief but informative definition from basketball reference:

“Effective Field Goal Percentage; the formula is (FG + 0.5 * 3P) / FGA. This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal. For example, suppose Player A goes 4 for 10 with 2 threes, while Player B goes 5 for 10 with 0 threes. Each player would have 10 points from field goals, and thus would have the same effective field goal percentage (50%).”

Let’s start our investigation into EFG% by comparing EFG% to FG% for C&S vs pull-up jumpers split out between all shots and just 3 point shots.

FG% All shots 3 point 2 point
C&S 39.3% 37.4% 43.3%
Pull-Up 34.9% 29.9% 37.0%
EFG% All shots 3 point 2 point
C&S 52.0% 56.1% 43.3%
Pull-Up 39.2% 44.8% 37.0%

By using EFG% instead of FG% it becomes much clearer that C&S is  a better shot than a pull-up jumper.

We could also split the data by whether or not these shots were open (as we first saw in part 1).

EFG% All shots Open Defended
C&S 52.0% 53.3% 45.9%
Pull-Up 39.2% 41.4% 36.6%

We see that, of course, open shots are better than defended shots. However we can also see that using EFG% shows that a defended C&S is better than an open pull-up. Even without seeing the raw numbers, we suspect these results come from a large number of C&S shots being 3-point attempts.

So we could stratify further and look at C&N vs 3-point vs openness. And while it would be easy to make a number of stratified 2×2 tables, at a certain point it makes more sense to just use a model and account for as many possible variables that could effect FG% or EFG%. Which is not to say that examining raw percentages is a bad idea. After all, tables are a simple way to compare different kinds of shots, and since we have a large number of shots, we won’t really run into any sparsity problems.

But I don’t want to spend too long just looking at basic statistics. So, let’s continue down our previous path of looking at a simple regression to predict shot success, and see how we can improve it. However, we quickly run into two potential problems.

The first problem is one we touched on previously – looking at confounders. We want to understand variables that effect whether a shot is successful and that effect a players decision to take a specific type of shot. Last time we looked at defender distance as a potential confounder. This time we will also consider the shot clock. If there are only a few seconds left on the clock, a player may not have time to drive to the basket, and will have to just shoot. For future analyses, I’d want to explore other variables that are potential confounders such as game time remaining, the score, and who the closest defender is. But for now, let’s keep things relatively simple.

The second problem we will face is more complicated – how do model EFG%? Modeling FG% is easy because our outcome is binary, a shot is successful or not. Logistic regression requires a binary outcome, so we can’t just give successful 3 point shots an outcome of 1.5. Most statistics software will allow us to use weights in a quasibinomial framework, but I can’t think of a good way to use weights to get at EFG%. Weights are used to create pseudo-populations that up-weight or down-weight certain shots depending on how representative they are. The problem with giving a successful 3 point shot a weight of 1.5 is that it doesn’t make the outcome 1.5, rather it increases the representation of the characteristics of that shot.

If anybody has a way to examine EFG% using a weighted regression, please let me know. I only spent a few days thinking about this and while I have a work around, I would love to be able to show this analysis just use a simple regression framework. But I cannot, for the life of me, think of a way to do it. I tried for a while to reframe the problem by using functionals instead of trying to target a regression parameter, but I still don’t think it works.

So what is my work around? Don’t look at EFG%. Instead split out 3 point shots and 2 point shots and examine them separately. 3 point shots and 2 point shots are different enough that trying to pool them into a single population will obscure the differences and lead to analytical problems. Especially since it may be naive to assume a constant treatment effect of C&S for both 2-point and 3-point shots. We could also split out the two kinds of shots and instead look at the expected number of points per shot. Stephen Shea has touched on this, which makes me think it is a good avenue for further investigation.

On a more philosophical level, there always seems to be this strong desire to collapse everything down to single number. We see this a lot when we try to invent statistics that fully capture how good a player is with one number. And while I agree there is value in a single statistic, I also think there is value in nuance and increased granularity. My goal is to examine C&S shots, and there is no harm in splitting that out by shot value.

But I freely admit that I may be missing something obvious and there is an easy way to use EFG%. Again, if you have any ideas, please let me know.

Next time in this series, we will dive into causal effects of catch & shoot vs pull-up jumpers.

Part 3 can be found here

NBA Fouls – Data, basic stats and visualizations

This is part 2 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.
Part 1 can be found here

I’ll be pulling edited sections from the paper I wrote with Udam Saini for the 2017 Sloan Sports Analytics Conference research paper competition. A full, finalized version of the paper will be available at a later date.

The goal of this project is to examine how NBA players accrue fouls and if it is possible to mitigate their foul tendencies through simple coaching decisions. Let’s start with getting some data and looking at basic foul rates.


We examine play-by-play data and box-score data from the NBA for the 2011-2012, 2012-2013, 2013-2014, 2014-2015, and 2015-2016 seasons. This data is publicly available from The play-by-play contains rich event data for each game. The box-score includes data for which players started the game, and which players were on the court at the start of a quarter. Data, in csv format, can be found here.

Using the box-score data and substitutions in the play by play for each game, we can determine the amount of time any given player has actively played in the current game at each event in the play by play data.  We look at only active player time, rather than game time within a game to accurately determine how often a player commits foul. Most discussion of time throughout discusses only actual play time; that is, individual person time for each player. Using player play time should control for substitution patterns, as a player in foul trouble will likely not play until later in the game. If we used game time, it would artificially increase time between fouls. Additionally, censoring times for each player in each game were generated.  For example if a player only committed 3 fouls in a game, an entry was generated for his 4th  foul, with foul time equal to the max player time and an indicator that the foul did not occur. This is important as we need to account for censored fouls in our analysis.

For now, let us only consider only centers in our analysis, to minimize effects of fouling patterns between different NBA positions. Overall, we will further limit ourselves to Al Horford, Andrew Bogut, Brook Lopez, DeMarcus Cousins, Dwight Howard, Marc Gasol, Robin Lopez, and Tyson Chandler.
In our analysis, we will focus on DeMarcus Cousins, Al Horford, and Robin Lopez as these three centers exhibit three distinct trends that we see in other centers that we analyzed. All centers considered share many of the same characteristics in our analysis as well.

Summary Statistics

Even simple analysis and statistics can give us some insight into how NBA players accrue up to 6 personal fouls over the course of a game. Table 1 gives a few summary statistics. Table 1a gives basic statistics for most of DeMarcus Cousins’ fouls from the 2011-2016 season. We can see that on average, Cousins commits his 1st personal foul after about 500 seconds (or about 8 minutes and 20 seconds) of his personal playing. By contrast, he commits his 4th foul about 300 seconds (or about 5 minutes) of personal playing time after committing his 3rd foul. Table 1b gives the same statistics for Al Horford where we see that his 1st foul comes after an average of about 823 seconds while he commits his 4th foul an average of about 311 seconds after his 3rd. From these numbers, it might appear that Horford is more “tilted” given his time between fouls shrinks more than Cousins.

Table 1: Summary statistics for Cousins, Horford, Lopez and all centers pooled. Gives the average time to each foul by number, the number of games with exactly that many fouls, and the number of games with at least that many fouls. Cousins had 11 games with only 1 foul, but 115 with 5 or more.

However, the tables also show that Horford had 80 games in which he only recorded a single foul and only 37 games where he recorded 4 or more fouls. By contrast, Cousins had only 11 games with a single foul and 193 games with 4 or more. Because games often end before a player commits all six fouls, many of the foul times are right censored by the end of the game. These foul times are not included in simple summary statistics and therefore merely examining the average time to foul does not accurately reflect all the differences between players or how those players individually accrue fouls.

Visualization – Survival Curves

Next, we visualize foul rates for each player by using Kaplan Meier survival curves. A survival curve, in general, is used to map the length of time that elapses before an event occurs. Here, they give the probability that a player has “survived” to a certain time without committing a particular number foul. These curves are useful for understanding how a player accrues fouls while accounting for the total length of time during which a player is followed, and allows us to compare how the different fouls are accrued.

Screen Shot 2017-02-04 at 10.19.02 AM.png
Figure 1: Survival curves for Cousins, Horford and Lopez. Displays the probability that a player has “survived” to a certain time without committing a particular foul by number. These curves include fouls that at censored by the game ending which obscures some of the patterns.

Figure 1a gives the overall survival curves for Cousins. From the graph, there appears to be some evidence his time to foul decreases as he accrues fouls because there is layering between the fouls. While the trend may seem small, it is much starker than that for other centers, as we can see in Figures 1b for Al Horford and 1c for Robin Lopez. Their curves appear much more random. The survival curve for Al Horford’s 6th foul seems abnormal, along with Robin Lopez to a smaller extent. This abnormality is likely explained by the small sample sizes for 6 fouls as seen in Table 1.

I’d like to note here that it is important to use survival curves in this scenario as it accounts for censoring. If we were to just look at the densities of fouls for a given player, we might falsely see a very different trend. Figure 2 shows raw foul densities for DeMarcus Cousins, and there is a clear ordering for the fouls.

Figure 2: Foul densities for DeMarcus Cousins. From this graph it would appear that there is a big difference in foul times. However these graphs do not take censoring into account

As mentioned above, if games were infinitely long, and players continued to play, we would observe every player until he committed his 6th personal foul and was removed from the game. As games are of finite length, many fouls are censored due the end of follow up time. Therefore, it makes sense that the 5th and 6th fouls would be subject to sampling bias. For example, if the 5th foul is committed with 4 minutes left in the game, we will never observe a 6th foul that comes 5 minutes later. To help adjust for this censoring, we considered limiting analysis to only games where all 6 fouls were committed. However, this limitation severely restricts the sample sizes for all players. Instead, we will examine games for each player where they committed a minimum of five fouls and limit our analysis to the first four fouls. This foul restriction gives us a larger sample size, though restricts us from gaining understanding about how players accrue their 5th and 6th fouls.

Screen Shot 2017-02-04 at 10.23.14 AM.png
Figure 3: Survival curves for the first 4 fouls for Cousins, Horford, and Lopez when the player commits a minimum of 5 fouls. Cousins has a clear ordering to how he commits these fouls given that his time to foul decreases as he accrues fouls. Horford does not display this trend.

Figures 3a, 3b, and 3c show the 5 foul minimum survival curves for Cousins, Horford, and Lopez. Cousins displays much clearer ordering, where the more fouls he accrues, the more likely he is to foul. However, Horford and Lopez show much less distinction between the fouls. Lopez shows some ordering, especially for his 4th foul, but Horford’s curves are fairly random.

Of course we are not controlling for nearly enough variables and the sample size is sadly limited. A full discussion of areas for further research will be discussed later. However, for now we now have a nice way to model and visualize fouls so we can understand them better moving forward.

Part 3 of this series can be found here 

Catch & Shoot – Basics

I’ve been interested in catch & shoot (C&S) jump shots for a while now, pretty much ever since I read this article by Stephen Shea. There is this idea that a C&S is better than a pull-up shot. And based on everything I’ve read and analyzed, this holds true. This lead me to ask “what is it about a C&S that makes it such a superior shot?”

To answer this question, I started with the NBA definition of catch & shoot, namely “Any jump shot outside of 10 feet where a player possessed the ball for 2 seconds or less and took no dribbles.” Let’s break this definition down a little.
– A shot past 10 feet  is further from the basket, which I would think decreases the likelihood of success. However we can assume that when comparing to pull-up jumpers, the comparison group is past 10 feet as well. When the time come to analyze data, we will just have to restrict to shots past 10 feet. No problem.
– Taking no dribbles means a player doesn’t need to take time to collect the ball or fight the momentum of movement. It makes sense that not having to dribble would increase the chance of a making a basket.
– On the surface, possessing the ball for 2 seconds or less seems like it would give a player less time to set himself and shoot. Therefore I would assume short possession time would decrease the probability of success. However, a shorter possession means that the defense has less time to get in position as well. And this is where I get irked about the definition of catch & shoot.

When I began to look at C&S vs pull-up jumpers I hypothesized that the effect of the defense was confounding the effect of a C&S. I still thought a C&S shot would be better than an pull-up, but without controlling for defense, I was skeptical of how much better.

Let’s take a step back and define some terms. If we think of a C&S as a “treatment” ( my education and training are in public health so I often default to medical terms) compared to a “control” shot being a pull-up jumper, and shot success as the outcome, then our goal is to examine the treatment effect of a CnS on the outcome. We can also say that the type of shot is the independent variable, and success of the shot is the dependent variable. But shots are not randomized to be CnS or pull-up, so just looking at raw numbers won’t necessarily give us the full picture. A “confounder” is any variable that effects both the treatment and the outcome. Defense is very likely a confounder for any shot as a player is more like to take and more likely to make a wide open shot. Similarly, he is less likely to take and less likely to make a highly contested shot.

The NBA has a really great statistics site, However, it doesn’t get to the granularity that I want. Thankfully does. I went and pulled 50,000 shots from the 2014-2015 season. I chose that season because starting in 2016, defender data isn’t available. I also restricted to players who took at least 100 jump shots. I ended up with 50,000 shots (I assume this size is preset), which I then further restricted to 38,384 jump shots that were over 10 feet.

Of these 38,384 shots 21,917 had zero dribbles and a possession of 2 seconds or less and thus were defined as a catch & shoot. The remaining 16,467 shots were labeled as pull-up shots.

One more definition, for the purposes of this analysis (and future analyses), I consider any shot where the closest defender is more than 4 feet away to be an “open” shot. All other shots are considered “defended.”

Here are some basic statistics:

Counts Defended Open
C&S 4049 17868
Pull-Up 7468 8999
Percent Defended Open
C&S 10.54% 46.55%
Pull-Up 19.46% 23.44%

Most shots are open C&S. The majority of C&S shots are open, and the majority of open shots are C&S.

I could split this out further and look at 2pt shots vs 3 pt shots, but let’s skip that for now and put it into a simple logistic regression modeling the probability a shot is successful. We can include an indicator for whether or not a shot was a 3pt attempt:

Estimate Std. Error z value Pr(>|z|)
Intercept -0.5467 0.0175 -31.23 0.0000
C&S 0.2955 0.0233 12.70 0.0000
3pt -0.2727 0.0230 -11.88 0.0000

Now let’s control for wether a shot was open or not:

Estimate Std. Error z value Pr(>|z|)
Intercept -0.6305 0.0217 -29.06 0.0000
C&S 0.2597 0.0239 10.88 0.0000
open 0.1626 0.0245 6.62 0.0000
3pt -0.2927 0.0232 -12.64 0.0000

Side note: I try not to pay too much attention to p-values, preferring to focus on effect sizes.

We see that C&S does increase the chance of success, though the effect is mitigated when we also control for whether or not a shot was open. I also fit the models with various interactions, but they had little impact (in fact the interaction between C&S and open was negative, though small).

This is why I am always a little irked by the definition of a catch & shoot – it doesn’t account for how open the player is. A player with a high C&S  FG% who mostly takes open shots should be evaluated differently than a player with a similarly high C&S  FG% who takes mostly defended shots. Yet raw C&S FG% or C&S EFG% obscures this difference.

We are just getting started here. After all,  so far I’ve only presented simple regressions with some (confusing and hard to interpret) log odds ratio coefficients. In future posts I will get into how we can estimate the causal effect of a catch & shoot for only for raw field goal percentage, but also effective field goal percentage, which will give a bonus to three point shots.

Part 2 of this series can be found here

NBA Fouls – I Love DeMarcus Cousins

This is part 1 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.


If you’ve ever talked about the NBA with me for any significant amount of time, you will know that one of my favorite players is DeMarcus Cousins (my favorite player being, of course, Shaun Livingston). I’ve always liked Boogie for the usual reasons related to his abilities on the court, of course, but also because he has been the focus and inspiration of much of my analytics work for the past year. Before I get into all the details and numbers, I’d like to share the story of how it all came to pass.

I am from Berkeley, California. As such, my home team is the Golden State Warriors. Last school year, during my Christmas break I was able to attend the December 28th 2015 matchup between the Warriors and Sacramento Kings at Oracle Arena. Many people remember that game because the end of the first half featured a three-point shoot out between Stephen Curry and Omri Casspi:

It was incredibly exciting and made for a close game.

A few months later, at the 2016 Sloan Sports Analytics Conference, that sequence came up in a conversation with my friend who was also in attendance.  I mentioned that I was at that game, but that the end of the first half wasn’t what I remembered most about the game. What I remember is this:

It was my first time seeing Cousins get ejected and even from my seat in the upper bowl, I could feel how frustrated and upset he was about the whole ordeal.

My friend commented that “if only Boogie could be, like, 15% less angry – he would be the most dominant player in the game.” Which of course got me thinking – how *would* you quantify how mad DeMarcus Cousins is at any given time?

A potential answer presented itself several months later at the 2017 Joint Statistical Meetings where I attended a session titled “For the Love of the Game: Applications of Statistics in Sports.” In that session, Douglas VanDerwerken presented “Does the Threat of Suspension Curb Dangerous Behavior in Soccer? A Case Study from the Premier League.” This paper (which can be found here for interested readers) showed that as EPL players approach the yellow card limit, and thus face suspension, they are less likely to foul.

Thinking back to that December 28th game, and many additional Kings games I have watched, it seemed to me that Cousins would get heated and “tilted” and play more aggressively and therefore foul more often. I hypothesized that the more Cousins fouls the more likely he was to foul.

He does. But he’s not the only one.

I’ll get into the math/stats in a later post. But here is a general idea of how we can think about this problem. Given there is a fixed amount of time that a given player is on the court, we might expect fouls to follow a Poisson arrival process with inter-arrival times following an Exponential distribution where each foul is independent of the previous fouls. We can consider a survival model, and look at the “failure time” for each foul – in other words the time it takes a player to commit his 1st found, 2nd foul, etc. If, for example, the time between the 2nd and 3rd foul is significantly longer than than the time between the 4th and 5th foul, we would have evidence of some sort of “tilt.” We can model foul rates using a conditional risk set model for ordered events and do some analysis with a stratified Cox model. From there we can try to identify if there are any actions a coach/team can take in order to mitigate increased fouling rates.

I’ll save the details for later.

Part 2 of this series can be found here 




Every time I attend a talk about sports statistics, or read an interview, or talk to people in the field, inevitably the question of “how do I break into the industry” arises. And inevitably, the answer includes “put your stuff out there.”
Per that advice, this site will be my chance to share the work I have done in sports analytics.

Though to be fair, I was sorely tempted to purposefully *not* make a blog/site in order to act as a control of sorts. But I really did want to share some of the fun things I’ve been working on and solicit ideas for how to improve.

This site will evolve over time – depending on how focused on my dissertation I am at any given time. It will focus mostly on sports statistics/analytics, but really anything is game.