Kathy Explains all of Statistics in 30 Seconds and “How to Succeed in Sports Analytics” in 30 Seconds

I spent the weekend of October 19-21 in Pittsburgh at the 2018 CMU Sports Analytics Conference. One of the highlights of the weekend was Sam Ventura asking me to explain causal inference in 15 seconds. I couldn’t quite do it, but it morphed into trying to explain all of statistics in 30 seconds. Which I then had to repeat a few times over the weekend. Figured I’d post it so people can stop asking. I’m expanding slightly.

Kathy Explains all of Statistics in 30 Seconds

Broadly speaking, statistics can be broken up into three categories: description, prediction, and inference.

  • Description
    • Summaries
    • Visualizations
  • Prediction
    • Mapping inputs to outputs
    • Predicting outcomes and distributions
  • Inference/Causal Inference
    • Prediction if the world had been different
    • Counterfactual/potential outcome prediction

I’ll give an example in the sports analytics world, specifically basketball (this part is what I will say if I only have 30 seconds):

  • Description
    • Slicing your data to look at the distribution of points per game (or per 100 possessions or whatever) scored by different lineups
  • Prediction
    • Predicting the number of points your team will score in a game given your planned lineups
  • Inference/Causal Inference
    • Prediction of change in points per game if you ran totally new lineups versus the normal lineups

My day job is working for a tech healthcare company, and the following are the examples I normally use in that world:

  • Description
    • Distributions of patient information for emergency department admissions stratified by length of stay
  • Prediction
    • Predicting length of stay based on patient information present on admission
  • Inference/Causal Inference
    • Prediction of change in length of stay if chest pain patient had stress test vs having cardiac catheterization

So, it’s not *all* of statistics. But I think its important to understand the different parts of statistics. They have different uses and different interpretations.

More thoughts from the conference

Any time I am at a sports conference there is always the question of “how does one succeed in/break into the field?” Many others have written about this topic, but I’ve started to see a lot of common themes. So….

How to Succeed in Sports Analytics in 30 Seconds

Success in sports analytics/statistics seems to require these 4 abilities:

  • Domain expertise
  • Communication
  • Statistics
  • Coding/programming/CS type skills

Imagine that each area has a max of 10 points. You gotta have at least 5/10 in every category and then like, at least 30 points overall. Yes I am speaking very vaguely. But the point is, you don’t have to be great at everything, but you do have to be great at something and decent at everything.

I don’t feel like I actually know that much about basketball or baseball, or any sport really. I didn’t play any sport in college, and generally when I watch games, I’m just enjoying the game. While watching the Red Sox in the playoffs I don’t really pay attention to the distribution of  David Price’s pitches, I just enjoy watching him pitch. Hell, I spend more time wondering what Fortnite skins Price has. I’ve been guessing Dark Voyager, but he also seems like the kind of guy to buy a new phone just to get the Galaxy skin. Anyway. I’m not an expert, but I do know enough to talk sensibly about sports and to help people with more expertise refine and sharpen their questions.

And I know statistics. And years of teaching during graduate school helped me get pretty damn good at explaining complicated statistical concepts in ways that most people can understand. Plus I can code (though not as well as others). Sports teams are chock full of sports experts, they need experts in other areas too.

These four skills are key to succeeding in any sort of analytical job. I’m not a medical expert, but I work with medical experts in my job and complement their skills with my own.

Concluding thoughts from the conference

Man, no matter what a talk is about, there’s always the questions/comments of “did you think about this other variable” (yes, but it wasn’t available in the data), “could you do it in this other sport…” (is there data available on that sport?), “what about this one example when the opposite happened?” (-_-), “you need to be clearer about how effects are mediated downstream, there’s no way this is a direct effect even if you’ve controlled for all the confounding” (ok that one’s usually me), etc.

Next time, we are going to make bingo cards.

 

Some Boston Marathon Numbers

I was enjoying the third quarter of the tight Raptors vs Wizards game on Sunday night when my coworker sent me this article and the accompanying comments on the Boston Marathon:

Oh my. This article makes me disappointed. So let’s skip Cavs/Pacers and Westworld and dig in.

Introduction

On the surface it feels like the article is going to have math to back up the claim that “men quit and women don’t.” It has *some:*

But finishing rates varied significantly by gender. For men, the dropout rate was up almost 80 percent from 2017; for women, it was up only about 12 percent. Overall, 5 percent of men dropped out, versus just 3.8 percent of women. The trend was true at the elite level, too.

And some attempt to examine more than just the 2018 race:

But at the same race in 2012, on an unusually hot 86-degree day, women also finished at higher rates than men, the only other occasion between 2012 and 2018 when they did. So are women somehow better able to withstand extreme conditions?

But that’s it. No more actual math or analyses. Just some anecdotes and attempts to explain biological or psychological reasons for the difference.

Let’s ignore those reasons (controversial as they may be) and just look at the numbers.

Analysis

The metrics used are ill-defined. There is mention of how the midrace dropout rate was up 50 percent overall from last year, but no split by gender. As quoted above, the finishing rates varied significantly by gender, but no numbers are given. Only the overall dropout rates are reported. What does overall dropout rate mean? I assume it is a combination of runners who dropped before the race began plus those who dropped midrace. And then the overall dropout rates are 3.8% for women and 5% for men. But the splashy number is that men dropped out 80% more than last year whereas women only dropped out 12% more. Is… is that right? I’ve already gone cross-eyed. The whole thing reeks of hacking and obscures the meaning.

There are a lot of numbers here. Some are combined across genders. Some are overall rates, some are midrace. Some are differences across years.

Frustrated with the lack of numbers in the article, I went looking for the actual numbers. I found the data on the official website. I wish it had been linked in the article itself…

2018

CATEGORY NUMBER ENTERED NUMBER STARTED NUMBER FINISHED
PERCENT FINISHED
Runners
all 29,978 26,948 25,746 95.50%
male 16,587 14,885 14,142 95.00%
female 13,391 12,063 11,604 96.20%

Now we can do some proper statistics.

First, we can perform an actual two sample test and construct confidence intervals to see if there was a difference in finishing rates between genders.

For those who entered the race, the 95% confidence interval for the difference in percent finished between males and females was (-0.022, -0.006).

For those who started the race, the 95% confidence interval for the difference in percent finished between males and females was (-0.017, -0.007).

The difference is technically significant, but not at all interesting. And that is ignoring the fact that we shouldn’t really care about p-values to begin with.

But the article mentions dropout rate, not finishing rate, so let’s use that metric:

Of those who started the race, about 5% of males and 3.8% of females dropped out.

For those who started the race, the 95% confidence interval for the difference in percent dropout between males and females was (0.0069, 0.0168).

So yes, there is a significant difference. But with these kinds of sample sizes, it’s not surprising or interesting to see a tiny significant difference.

But what about 2017? What about the big change from 2017 to 2018? After all the main splashy metric is the 80% increase in dropout for men.

2017 (numbers from here)

CATEGORY NUMBER ENTERED NUMBER STARTED NUMBER FINISHED
PERCENT FINISHED
Runners
all 30,074 27,222 26,400 97.00%
male 16,376 14,842 14,431 97.20%
female 13,698 12,380 11,969 96.70%

In 2017, for those who entered the race, the 95% confidence interval for the difference in percent finished between males and females was ( -0.00006, 0.01497).

And in 2017, for those who started the race, the 95% confidence interval for the difference in percent finished between males and females was (0.0013, 0.0097).

Of those who started the race in 2017, about 2.8% of males and 3.3% of females dropped out.

For those who started the race in 2017, the 95% confidence interval for the difference in percent dropout between males and females was ( -0.0097, -0.0013).

So it does look like women dropped out more in 2017 compared to 2018. But the difference is so tiny that… whatever. This isn’t interesting. But at least now there are actual statistics to back up the claim.

But really, there’s not a lot going on here.

And FINALLY, we can look at the differences from 2017 to 2018.

The dropout rate for females increased from ~3.3% to ~3.8% which (using the exact numbers) was an increase of about 14.6% (not the 12% reported in the NYT article). The dropout rate for males increased from ~2.8% to ~5.0% which (using the exact numbers) was an increase of about 80% as reported.

At least now I understand where these numbers are coming from.

I still don’t buy it. Using dropout numbers instead of finishing numbers makes ratios much larger. An 80% increase in dropout sounds a lot more impressive than a 2% drop in finishing.

And that’s all before we try to compare to other years that might have also had extreme weather. If I had more time or interest I might look at the temperature, humidity, wind speed, wind direction etc for the past 20+ marathons. And then look at differences in dropout/finishing rate for men and women while controlling for weather conditions. That sort of analysis still probably wouldn’t convince me, but it would get closer.

Conclusion

This article is really frustrating. There are just enough scraps of carefully chosen numbers to make differences seem bigger than they really are. Comparing dropout rates to finishing rates is a bit hacky, and then comparing just two years (as opposed to many) gets even hackier. There’s an interesting hypothesis buried in the article and the data. And if we were to pull data on many marathons, we might get closer to actually being able to test if dropout rates vary by gender according to conditions. But the way the data is presented in the article obscures any actual differences and invites controversy. Audiences are eager for guidance with statistics and math. Tossing around a few numbers without explaining them (or giving a link to the source…) is such poor practice.

That Other Site I Work On

This site has been sparse lately and it is because I’ve been busy with two other projects.

The first is my actual day job. I finished my PhD in May of 2017 and began working at Verily Life Sciences in August of 2017. Did I turn down some jobs with pro teams? Yes. Yes I did. Why? That’s a story for another day. I like what I do at Verily. I get to have fun, with people I like, working on cool healthcare projects. Plus we work out of the Google offices in Cambridge which are very nice and full of free food and fun toys.

The second project I’ve been working on is the visualizations section of Udam Saini’s EightThirtyFour.

http://eightthirtyfour.com/visualizations

Udam and I worked together on this site’s NBA foul project, which started as an attempt to quantify how mad DeMarcus Cousins gets in games. We built survival models and visualizations to examine how players accrue fouls. But these models can just as easily be applied to assists, blocks etc. In fact, I took the ideas and examined how Russell Westbrook accrued assists in his historic triple-double season. By using survival models, we can see how the time between assists increased significantly after he reached 10 assists in a game. This could be seen as evidence in favor of stat padding.

The tool we’ve built on the site linked above allows you to look at survival visualizations and models for pretty much any player in seasons between 2011 and 2017. The stats primer linked in the first line has more explanation and some suggestions for players and stats to look at.

Survival analysis models and visualizations are not always the easiest to explain, but I think there is value in having other ways to analyze and examine data. Survival analysis can help us better understand things like fatigue and stat padding. And can help add some math to intangible things like “tilt.”

This project was also a lesson in working on a problem with a proper software engineer. I am a statistician and I’m used to a certain amount of data wrangling and cleaning, but I largely prefer to get data in a nice data frame and go from there. And I certainly don’t have the prowess to create a cool interactive tool on a website that blends SQL and R and any number of other engineer-y things. Well. I’d like to think I could, but it would take ages and look much uglier. And be slower. Conversely, my partner in crime Udam probably can’t sort through all the statistics and R code as fast as I can. My background isn’t even in survival analysis, but I still understand it better than a SWE. So this part of his site was a chance for us to combine powers and see what we could come up with. In between our actual Alphabet jobs, of course.

I think in the world of sports analytics, it’s hard to find somebody who has it all: excellent software engineering skills, deep theoretical knowledge of statistics, and deep knowledge of the sport (be it basketball or another sport). People like that exist, to be sure, but they likely already work for teams or are in other fields. I once tried to be an expert in all three areas and it was very stressful and a lot of work. Once I realized that I couldn’t do it all by myself and started looking for collaborations, I found that I was able to really shine in my expert areas and have way more fun with the work I do.

The same is true in any field. I wasn’t hired by Verily to be a baller software engineer *and* an expert statistician *and* have a deep understanding of a specific health care area. I work with awesome healthcare experts and engineers and get to focus just on my area of expertise.

In both my job and my side sports projects my goal is always to have fun working on cool problems with people I like. It’s more fun to be part of a team.

Anyway, have fun playing with the site, and if you have any suggestions, let us know :]

Russell Westbrook and Assists

Introduction

I was going to flesh this idea out and refine it for a proper paper/poster for NESSIS, but since I have to be in a wedding that weekend (sigh), here are my current raw thoughts on Russell Westbrook. I figured it was best to get these ideas out now …  before I become all consumed by The Finals.

I’ve been thinking a lot about Russell Westbrook and his historic triple-double season. Partially I’ve been thinking about how arbitrary the number 10 is, and how setting 10 to be a significant cutoff is similar to setting 0.05 as a p-value cutoff. But also I have been thinking about stat padding. It’s been pretty clear that Westbrook’s teammates would let him get rebounds, but there’s also been a bit of a debate about how he accrues assists. The idea being that once he gets to 10, he stops trying to get assists. Now this could mean that he passes less, or his teammates don’t shoot as much, or whatever. I’m not concerned with the mechanism, just the timing. For now.

I’ll be examining play-by-play data and box-score data from the NBA for the 2016-2017 season. This data is publicly available from http://www.nba.com. The play-by-play contains rich event data for each game. The box-score includes data for which players started the game, and which players were on the court at the start of a quarter. Data, in csv format, can be found here.

Let’s look at the time to assist for every assist Westbrook gets and see if it significantly changes for assists 1-10 vs 11+. I thought about looking at every assist by number and doing a survival analysis, but soon ran into problems with sparsity and granularity. Westbrook had games with up to 22 assists, so trying to look at them individually got cumbersome. Instead I decided to group assists as follows: 1-7, 8-10 and 11+. I reasoned that Westbrook’s accrual rate for the first several assists would follow one pattern, which would then increase as he approached 10, and then taper off for assists 11+.

I freely admit that may not be the best strategy and am open to suggestions.

I also split out which games I would examine into 3 groups: all games, games where he got at least 11 assists, and games where he got between 11 and 17 assists. This was to try to account for right censoring from the end of the game. In other words, when we look at all games, we include games where he only got, say, 7 assists, and therefore we cannot hope to observe the difference in time to assist 8 vs assist 12. Choosing to cut at 17 assists was arbitrary and I am open to changing it to fewer or more.

Basic Stats

Our main metric of interest is the time between assists, i.e. how many seconds of player time (so time when Westbrook is on the floor) occur between assists.

First, let us take a look at some basic statistics, where we examine the mean, median, and standard deviation for the time to assist broken down by group and by the different sets of games. Again, this is in seconds of player time.

BasicRussStats

We can see that if we look at all games, it appears that the time between assists goes down on average once Westbrook gets past 10 assists. However this sample of games includes games where he got upwards of 22 assists, which, given the finite length of games, means assists would tend to happen more frequently. Limiting ourselves to games with at least 11 assists, or games with 11-17 assists gives a view of a more typical game with many assists. We see in (1b) and (1c) that time to assist increases on average once Westbrook got his 10th assist.

However, these basic statistics only account for assists that Westbrook actually achieved, they do not account for any right censoring. That is, say Westbrook gets 9 assists in a game in the first half alone, and doesn’t record another assist all game despite playing, say, 20 minutes in the second half. If there game were to go on indefinitely, Westbrook eventually would record that 10th assist, say after 22 minutes. But since we never observe that hypothetical 10th assist, that contribution of 22 minutes isn’t included. Nor is even the 20 minutes of assist-less play. This basic censoring problem is why we use survival models.

Visualization

Next we can plot Kaplan Meier survival curves for Westbrook’s assists broken down by group and by the different sets of games. I used similar curves when looking at how players accrue personal fouls – and I’ll borrow my language from there:

A survival curve, in general, is used to map the length of time that elapses before an event occurs. Here, they give the probability that a player has “survived” to a certain time without recording an assist (grouped as explained above). These curves are useful for understanding how a player accrues assists while accounting for the total length of time during which a player is followed, and allows us to compare how different assists are accrued.

RussSurvCurves

Here is it very easy to see that the time between assists increases significantly once Westbrook has 10 assists. This difference is apparent regardless of which subset of games we look at, though the increase is more pronounced when we ignore games with fewer than 11 assists. We can also see that the time between assists doesn’t differ significantly between the first 7 assists and assists 8 through 10.

Survival Models

Finally we could put the data into a conditional risk set model for ordered events. I’m not sure this is the best model to use for this data structure, given that I grouped the assists, but it will do for now. I recommend not looking at the actual numbers and just noticing that yes, theres is a significant difference between the baseline and the group of 11+ assists.

SurvModelsRuss

If interested we can find the hazard ratios associated with each assist group. To do so we exponentiate the coefficients since each coefficient is the log comparison with respect to the baseline of the 1st  through 7th assists. For example, looking at the final column, we see that, in games where Westbrook had between 11 and 17 assists, he was 63% less likely to record an assist greater than 10 versus how likely he was to record one of his first 7 assists (the baseline group). Interpreting coefficients is very annoying at times. The take away here is yes, there is a statistically significant difference.

Discussion

Based on some simple analysis, it appears that the time between Russell Westbrook’s assists decreased once he reached 10 assists. This may contribute to the narrative that he stopped trying to get assists after he reached 10. Perhaps this is because he stopped passing, or perhaps its because his teammates just shot less effectively on would-be-assisted shots after 10. Additionally, there are many other factors that could contribute to the decline in time between assists. Perhaps there is general game fatigue, and assist rates drop off for all players. Maybe those games were particularly close in score and therefore Westbrook chose to take jump shots himself or drive to the basket.

What’s great is that a lot of these ideas can be explored using the data. We could look at play by play data and see if Russ was passing at the same rates before and after assist number 10. We could test if assist rates decline overall in the NBA as games progress. I’m not sure which potential confounding explanations are worth running down at the moment. Please, please, please, let me know in the comments, via email, or on Twitter if you have any suggestions or ideas.

REMINDER: The above analysis is something I threw together in the days between my graduation celebrations and The Finals starting and isn’t as robust or detailed as I might like. Take with a handful of salt.

Catch & Shoot – Causal Effects

This is Part 3 of my series on Catch & Shoot jumpers
Part 1 can be found here
Part 2 can be found here

“You know, for having the name ‘Causal Kathy’ you don’t seem to post that much causal inference” – my beloved mentor giving me some solid life advice.

So alright, let’s talk about causal inference. First of all, what is causal inference? It’s a subfield of statistics that focuses on causation instead of just correlations. Think of this classic xkcd comic:

correlation

To give a more concrete example, let’s think about a clinical trial for some new drug and introduce some terminology along the way. Say we develop a drug that is supposed to lower cholesterol levels. In a perfect world, we could observe what happens to a patient’s cholesterol levels when we give him the drug and when we don’t. But we can never observe both outcomes in a world without split timelines and/or time machines. We call the two outcomes, under treatment and no treatment, “potential outcomes” or “counterfactuals.” The idea is that both outcomes have the potential to be true, but we can only ever observe one of them, so the other would be “counter to the fact of the truth.” Therefore, true, individual causal effects can never be calculated.

However, we can get at average causal effects across larger populations. Thinking of our drug example, we could enroll 1,000 people in a study and measure their baseline cholesterol levels. Then we could randomly assign 500 of those subjects to take the drug and assign the remaining 500 to a placebo drug. Follow up after a given period of time and measure everyone’s cholesterol levels again and find individual level changes from baseline. Then we could calculate the average change in each group and compare them. And since we randomized treatment assignment, we would have a good idea of the causal effect of the drug treatment on the outcome of cholesterol.

However, we can’t always randomize. In fact, it’s rare that we can. There are many reasons to prevent well controlled randomized trials. For example, if we wanted to test whether or not smoking causes cancer, it would be unethical to randomly assign 500 subjects to smoke for the rest of their lives and 500 subjects not to smoke and see who develops cancer. Similarly, if we want to know whether or not a catch & shoot (C&S ) shot in basketball has a positive effect on FG%, we can’t randomly assign shots to be C&S vs pull-up.

In general, basic statistics and parametric models only give correlations. When you fit a basic regression, the coefficients do not have causal interpretations. This is not to say that regressions and correlations aren’t interesting and useful – if you’ve been following my series on NBA fouls, you’ll note that I don’t do causal inference there. Associations are interesting in their own right. But today, let’s start looking at causal effects for C&S shots in the NBA.

Side note: If you’d like to read more about causal inference in the world of sports statistics, this recent post from Mike Lopez is great.

Analysis

We’re going to be fitting models, so the first thing we’ll do is remove outlier shots. In this case I’m going to subset and only look at shots that were less than 30 feet (so between 10 and 30 feet). All the previous analysis was was basic and non-parametric, so including rare shots (like buzzer beater half court heaves) wouldn’t have a large impact. However now we are using models and making parametric assumptions, so outliers and influential points will have a bigger impact. It makes sense to remove them at this point.

Statistics side note: Part of the reason we need to remove outliers is because our analysis method will involve fitting “propensity scores” – the probably of treatment given the non-outcome variables. In this case we will model the probably a shot is a C&S given shot distance, defender distance, whether a shot was open, and the shot clock. For unusual outlier shots, the distance will often be abnormally long and it will be rare that the shot is successful. Thus if we left those shots in the data set, we would run into positivity problems.

Also, we aren’t going to look at the average causal effect of a C&S. That estimate assumes that all shots could be a C&S or a pull-up, and the player chose one or the other based on some variables. Most pull-up shots couldn’t have been a C&S, even if the player wanted them to be. But most C&S shots could have been pull-up shots. The player could have elected to hold the ball for longer and dribble a few times before taking the shot. Or even driven to the basket. Of course things are more nuanced. For example, some players (like my beloved Shaun Livingston) will almost never take a 3 point shot, C&S or pull-up. While others (like his teammate Klay Thompson) are loath to pass up a chance to take a C&S. Therefore looking at an average causal effect is not a great strategy. Instead we will look at the effect of a C&S on shots that were C&S shots. In the literature this is called the effect of treatment on the treated (ETT) or sometimes the average treatment effect on the treated (ATT). I use ETT, partially because that is what I was taught and partially because my advisor’s initials are ETT and it makes me giggle.

For those interested, here is a quick primer on the math behind ETT.

Results

Below we see the effect of C&S on FG% on C&S shots. This effect is calculated for threes and twos. The estimates are computed with functionals that are functions of the observed data; they are not coefficients in a regression. As such, I calculated means, standard errors, and confidence intervals by repeatedly bootstrapping the data (250 times) and using weights randomly drawn from an exponential distribution with mean 1. This is a way to bootstrap the data without having to draw shots with replacement and uses every shot in every resample.

All estimates control for the following potential confounders:  shot distance, defender distance, an indicator for whether a shot was open, and the time left on the shot clock. I do not think this is a rich enough set of variables to full control for confounding, but its a decent start.

ETT_Results_7_28_18

We can see that not dribbling and possessing the ball for less than 2 seconds (catch & shoot definition) does effect FG%, for both 2 and 3 point shots. The effect size is small but positive, about 0.04. This means that a C&S attempt is about 4% more likely to be successful than if that shot were taken as a pull-up. This is a very small causal effect.

To me, this shows that the effect of a C&S isn’t as big as the raw numbers suggest. I calculated a few other measures of causal effects, both for ETT and ATE, but found nothing significant. I’m certain that the modeling assumptions required are not fully met, which may be biasing results towards the null. Were I to move forward on this project, I would dig deeper into the models I’m using and try to get a better understanding of the best way to model and predict both why a player elects to take a certain type of shot and what makes a shot more likely to be successful.

When I set out on this project, I was mostly just upset about the definition of a catch & shoot, since it didn’t take openness into account. I like to think I’ve made my case. If I had to make a change, I’d want the NBA to track open C&S shots as a separate statistic. Maybe even split it out further into twos and threes, or at least emphasize EFG% over FG%. The actual causal effect of a C&S isn’t that big – I’d rather keep track of a stat that does have a big effect.

I’ll probably let this project go for a while. A lot of other people are looking at it and doing a good job of it. I can add the causal component, but I’d rather look at under-examined areas.