I will expand this talk into a larger blog post in the near future.
I was going to flesh this idea out and refine it for a proper paper/poster for NESSIS, but since I have to be in a wedding that weekend (sigh), here are my current raw thoughts on Russell Westbrook. I figured it was best to get these ideas out now … before I become all consumed by The Finals.
I’ve been thinking a lot about Russell Westbrook and his historic triple-double season. Partially I’ve been thinking about how arbitrary the number 10 is, and how setting 10 to be a significant cutoff is similar to setting 0.05 as a p-value cutoff. But also I have been thinking about stat padding. It’s been pretty clear that Westbrook’s teammates would let him get rebounds, but there’s also been a bit of a debate about how he accrues assists. The idea being that once he gets to 10, he stops trying to get assists. Now this could mean that he passes less, or his teammates don’t shoot as much, or whatever. I’m not concerned with the mechanism, just the timing. For now.
I’ll examining play-by-play data and box-score data from the NBA for the 2016-2017 season. This data is publicly available from http://www.nba.com. The play-by-play contains rich event data for each game. The box-score includes data for which players started the game, and which players were on the court at the start of a quarter. Data, in csv format, can be found here.
Let’s look at the time to assist for every assist Westbrook gets and see if it significantly changes for assists 1-10 vs 11+. I thought about looking at every assist by number and doing a survival analysis, but soon ran into problems with sparsity and granularity. Westbrook had games with up to 22 assists, so trying to look at them individually got cumbersome. Instead I decided to group assists as follows: 1-7, 8-10 and 11+. I reasoned that Westbrook’s accrual rate for the first several assists would follow one pattern, which would then increase as he approached 10, and then taper off for assists 11+.
I freely admit that may not be the best strategy and am open to suggestions.
I also split out which games I would examine into 3 groups: all games, games where he got at least 11 assists, and games where he got between 11 and 17 assists. This was to try to account for right censoring from the end of the game. In other words, when we look at all games, we include games where he only got, say, 7 assists, and therefore we cannot hope to observe the difference in time to assist 8 vs assist 12. Choosing to cut at 17 assists was arbitrary and I am open to changing it to fewer or more.
Our main metric of interest is the time between assists, i.e. how many seconds of player time (so time when Westbrook is on the floor) occur between assists.
First, let us take a look at some basic statistics, where we examine the mean, median, and standard deviation for the time to assist broken down by group and by the different sets of games. Again, this is in seconds of player time.
We can see that if we look at all games, it appears that the time between assists goes down on average once Westbrook gets past 10 assists. However this sample of games includes games where he got upwards of 22 assists, which, given the finite length of games, means assists would tend to happen more frequently. Limiting ourselves to games with at least 11 assists, or games with 11-17 assists gives a view of a more typical game with many assists. We see in (1b) and (1c) that time to assist increases on average once Westbrook got his 10th assist.
However, these basic statistics only account for assists that Westbrook actually achieved, they do not account for any right censoring. That is, say Westbrook gets 9 assists in a game in the first half alone, and doesn’t record another assist all game despite playing, say, 20 minutes in the second half. If there game were to go on indefinitely, Westbrook eventually would record that 10th assist, say after 22 minutes. But since we never observe that hypothetical 10th assist, that contribution of 22 minutes isn’t included. Nor is even the 20 minutes of assist-less play. This basic censoring problem is why we use survival models.
Next we can plot Kaplan Meier survival curves for Westbrook’s assists broken down by group and by the different sets of games. I used similar curves when looking at how players accrue personal fouls – and I’ll borrow my language from there:
A survival curve, in general, is used to map the length of time that elapses before an event occurs. Here, they give the probability that a player has “survived” to a certain time without recording an assist (grouped as explained above). These curves are useful for understanding how a player accrues assists while accounting for the total length of time during which a player is followed, and allows us to compare how different assists are accrued.
Here is it very easy to see that the time between assists increases significantly once Westbrook has 10 assists. This difference is apparent regardless of which subset of games we look at, though the increase is more pronounced when we ignore games with fewer than 11 assists. We can also see that the time between assists doesn’t differ significantly between the first 7 assists and assists 8 through 10.
Finally we could put the data into a conditional risk set model for ordered events. I’m not sure this is the best model to use for this data structure, given that I grouped the assists, but it will do for now. I recommend not looking at the actual numbers and just noticing that yes, theres is a significant difference between the baseline and the group of 11+ assists.
If interested we can find the hazard ratios associated with each assist group. To do so we exponentiate the coefficients since each coefficient is the log comparison with respect to the baseline of the 1st through 7th assists. For example, looking at the final column, we see that, in games where Westbrook had between 11 and 17 assists, he was 63% less likely to record an assist greater than 10 versus how likely he was to record one of his first 7 assists (the baseline group). Interpreting coefficients is very annoying at times. The take away here is yes, there is a statistically significant difference.
Based on some simple analysis, it appears that the time between Russell Westbrook’s assists decreased once he reached 10 assists. This may contribute to the narrative that he stopped trying to get assists after he reached 10. Perhaps this is because he stopped passing, or perhaps its because his teammates just shot less effectively on would-be-assisted shots after 10. Additionally, there are many other factors that could contribute to the decline in time between assists. Perhaps there is general game fatigue, and assist rates drop off for all players. Maybe those games were particularly close in score and therefore Westbrook chose to take jump shots himself or drive to the basket.
What’s great is that a lot of these ideas can be explored using the data. We could look at play by play data and see if Russ was passing at the same rates before and after assist number 10. We could test if assist rates decline overall in the NBA as games progress. I’m not sure which potential confounding explanations are worth running down at the moment. Please, please, please, let me know in the comments, via email, or on Twitter if you have any suggestions or ideas.
REMINDER: The above analysis is something I threw together in the days between my graduation celebrations and The Finals starting and isn’t as robust or detailed as I might like. Take with a handful of salt.
The weekend has come and gone and so has the 2017 Sloan Sports Analytics Conference. This was the third time I attended the conference and easily the most enjoyable experience I have had to date.
Many others have recapped a lot of the compelling analytics content, so I don’t feel compelled to repeat much of that. Moreover, I don’t have the journalistic abilities yet to condense everything I learned into a nice blog entry. AND I have a proper dissertation committee meeting this week, followed by the ENAR biometrics conference next week. Between the two, I haven’t been burdened with an abundance of time. So here are some thoughts on the conference, which will inevitably spiral into larger thoughts on the field as a whole.
My experience at SSAC this year was a weird mix of trying to see famous people speak, trying to hear interesting analytics/statistics talks, and trying to meet as many people as possible. In previous years I didn’t know anyone and wasn’t thinking seriously about a career in this field, so I prioritized panels with famous speakers. This was great for maximizing entertainment value. But now that I am making a proper attempt to pursue sports analytics as a career, it was clear I needed to actually understand where the field is and where its going… while still taking time to see big names where possible. Because who can resist Nate Silver and Mark Cuban or Nate Silver and Adam Silver. It’s clear that experiences at SSAC will vary greatly depending on interests and goals.
It’s also interesting to be at the conference while in a position of actively looking for a job. During almost every conversation I had, I was trying to maintain a balance between a number of potentially conflicting motivations. Mostly, I just wanted to nerd out and talk about sports stats with like-minded people. But I also wanted to make sure the work I am doing is in the right direction and get advice on how to be better. How can I improve my work not just to be better intrinsically, but also to have a bigger impact. And then at a certain point, especially if I was talking to somebody working for a team, I’d think “is this person on a team that is hiring? Would they want to hire me?” I’m better at networking than I used to be, but at the end of the day, I am a still a somewhat awkward stats nerd. One big takeaway from the conference for me was that I need to be more aggressive and confident in general. It’s easy to have imposter syndrome. I eventually felt generally okay with the other stats folks, but at a conference with a lot of MBAs, it can be intimidating to talk to new people. Especially since I was in the minority at SSAC.
Yep, I’m going to talk about diversity for a minute. There are a lot of men at Sloan. A lot of white men. And of the women who are there, few are statisticians. I was lucky enough to meet Diana Ma who does analytics for the Indiana Pacers. We hugged out of sheer joy of finding another woman in sports stats. Diana is the first woman I have met in person who works for a team in any sport. I’ve been in STEM for most of my life, and I’m used to being in scenarios that are majority white male, but SSAC takes the cake. Conference attendees are, for the most part, aware of the demographic disparities. Not just about the lack of women, but the lack of any other minorities. And there are always conversations about how to increase the diversity of the conference and the field overall. I don’t have a good answer, but I’m glad people (including Daryl Morey) are talking about it.
Side note to the jerks on twitter, and elsewhere, questioning why diversity is important – this is for you. Even if you want to argue that diversity adds nothing to the end product, equality is important. Not everyone who had interest in the conference had access. And not everyone who might have had interest had access to resources to foster that growing interest.
Moreover, I distinctly remember being at SSAC in 2015 and hearing somebody say that women shouldn’t bother with this field, because it is such a man’s world. I can’t remember if it was on a panel or a conversation I overheard, but it struck a huge chord with me and was a large part of why I eschewed the field for so long. Fortunately, I am lucky enough to have incredibly supportive friends, family, and mentors.
Which brings me to a final, big takeaway from the conference this year. Success in sports analytics has a large component of luck. From the family into which you were born, to the school you attended and the TA you happen to have for a class, to who re-tweets you, to who you randomly happen to be sitting next to at a panel. Don’t get me wrong, you also need skill. You need to be good enough that when you are lucky enough to make a connection or have your blog post re-tweeted, people find value in it and pay attention.
Our entire careers are about quantifying uncertainty and randomness in the data we examine; we should acknowledge the randomness in our lives.
Anyway. I met a lot of really awesome people. I’m going to avoid trying to name everyone, because I’m sure I’ll forget somebody and then feel bad. But needless to say, everyone was friendly, smart, and incredibly welcoming. I wish the conference were a few days longer so things wouldn’t be so rushed, interesting panels wouldn’t overlap, and I’d have more time to chat with everyone. It’s all well and good to talk over email or the phone, but in person conversations are ideal. Maybe next year I just won’t sleep.
I hope everyone makes it out to NESSIS in September.
- Does specializing in a sport early really increase the risk of injury later in life? I think so. So do a lot of other people. But I also spoke to some folks this weekend who don’t buy it. Awesome. Let’s run the numbers. And then do it a few more times to make sure we have reproducible work.
- I was on the Hot Takedown live recording. I may or may not have totally whiffed a question about the Warriors. Caught the tail end of John Urschel’s segment. He’s great.
- Highlight of the weekend was a ~20 minute 1-on-1 conversation with John Urschel. We talked about causal inference, Voronoi diagrams, and super bowl win probability models. He is giant nerd and an incredibly warm person.
- Mark Cuban does not like Donald Trump.
- I love when athletes are on panels. Especially random additions. It’s nice to see Sue Bird and Shane Battier every year, but they are used to it by now. Luis Scola was a last minute addition to a few panels, and he gave thoughtful insights into how players use analytics.
- Luis Scola is a very tall man. So were a lot of the men at this conference. I am 5’4″.
- I know I want to pursue a career in sports analytics/statistics, but I have no idea of the best avenue. Should I try to join a team? The NBA? An independent company? Should I go get a regular job that will pay more and/or be less time intensive and pursue my own projects on the side? Which of these paths makes the most impact from the diversity side of things?
- I wonder if Bob Myers would be my agent when negotiating a job offer.
- Zach Lowe’s voice is as enjoyable in person as it is on his podcast.
- I still have yet to meet/introduce myself to Mike Zarren. Which is insane given the number of events I been at with him, my love for the Celtics, and the fact that I personally know another member of the Celtic’s analytics team. At this point, I almost want to see if I can meet Brad Stevens and Danny Ainge before meeting Mike.
- The name tags this year were not conducive to reading names. I wonder if that was intentional.
- Hynes >>>>>> BCEC
- Were we supposed to get two drink tickets? I only got one. But I feel like I got two last year.
- Years ago I did a project on optimal strategy for penalty kicks in the World Cup. I should update that.
- I have so many ideas for projects. So many. But this pesky PhD thing is going to get in the way for a few months. I’ll be able to put some stuff out, but school is going to take priority for a while.
- I love sports analytics so much.
This is Part 3 of my series on DeMarcus Cousins and how NBA players accrue personal fouls.
Part 2 can be found here.
Part 1 can be found here.
I strongly recommend reading at least Part 2 before continuing as I reference it.
To provide more statistical rigor, we analyze our players using a conditional risk set model for ordered events. This model, first proposed by Prentice, Williams, and Peterson, models the hazard at each foul event time as a function of the current number of fouls accumulated and time since the last foul. The model is flexible and can include other covariates as needed. For this paper, our covariates include the lead or deficit in the score of the player’s team, game time in minutes, and an interaction between the two. We chose these covariates, as we believe that a closer game can have an impact on a player’s fouling rates. We include actual game time in minutes to reflect how close the game is to ending, and to account for potential overtime periods.
Let and be the foul and censoring time for the kth foul (k=1, 2, …,6) in the ith game and let be the vector of covariates for the ith game and with respect to the kth foul. We assume and are independent given . We then define and let be a vector of unknown regression coefficients. Under the proportional hazard assumption, the hazard function of the ith game and for the kth foul is:
From Table 2, we can see that the difference in score plays a minimal impact on player fouling rates, even after adjusting for game time for Cousins, Horford, and Lopez. Closer games do not seem to cause more fouls to be committed. However, the total game time that has been played has an impact. Furthermore, as time goes on, it appears that players are less likely to foul. This trend holds true for our three players of interest and all players when pooled together, which is surprising considering that players are more likely to foul later in the game. With this analysis, it shows that players are more likely to foul if they have already fouled as the game goes on. If a player has not fouled already in the game, they are less likely to foul since time plays a negative relationship with likelihood to foul. This trend holds true for all centers we analyzed. These results are line with what we saw in Figure 1. Moreover, these results are similarly likely due to the selection bias we have that precludes us from seeing every foul in every game.
As before, we can limit our analysis to games where the players had at least 5 fouls, and examine analysis of the first four fouls. Table 3 displays the survival model output for Cousins, Horford and Lopez when we use the restricted dataset. For all players, fouls 2, 3, and 4 are committed significantly sooner than the prior foul. To find the hazard ratios associated with each foul, we exponentiate the difference in the coefficients since each coefficient is with respect to the baseline of the 1st foul. For example, when Cousins has 3 fouls he is 405% more likely to commit a foul at any given time than when he only has 2 fouls. Cousins is 303% more likely to commit a foul when he has four fouls compared to when he only has three. Although the hazard ratios increase dramatically with each foul, it is important to keep in mind that the initial probability of fouling at any given moment is low, as the initial foul takes nearly 500 seconds (over 8 minutes) to take place on average for DeMarcus Cousins.
It is interesting to note that the opposite effect happens with game time. As each minute passes in the game, Cousins is only 90% as likely to commit a foul as the previous minute. This trend holds for all players.
From the table, we can see that although all players seem to have this “tilting” behavior, DeMarcus Cousins has a higher likelihood of committing a foul than other players as he accrues fouls. Cousins seems to “tilt” more than others centers in our analysis. Part of this behavior may be explained by teams attacking players who already have many fouls, attempting to get them in foul trouble. However, we believe that no one factor can tell the complete story.
Inner statistician reaction:
Oh God, Sloan is going to be *insufferable* this year, isn’t it?
Inner sports fan reaction:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHHHHHHHHH WOW WOW WOW WOW WHAT A GAME!!!!!!
Every time I attend a talk about sports statistics, or read an interview, or talk to people in the field, inevitably the question of “how do I break into the industry” arises. And inevitably, the answer includes “put your stuff out there.”
Per that advice, this site will be my chance to share the work I have done in sports analytics.
Though to be fair, I was sorely tempted to purposefully *not* make a blog/site in order to act as a control of sorts. But I really did want to share some of the fun things I’ve been working on and solicit ideas for how to improve.
This site will evolve over time – depending on how focused on my dissertation I am at any given time. It will focus mostly on sports statistics/analytics, but really anything is game.