What follows is some info on the work Sameer Deshpande and I did for the 2019 NFL Big Data Bowl. We will be presenting this work at the BDB Finals at the NFL Combine in Indianapolis on February 27th.
We are in the process of putting together a series of blog posts that will explain our method in, hopefully, an easily digestible way. Until then, we wanted to share a copy of the paper as it was submitted to the contest.
We note that there are a few caveats:
1. This is very much proof-of-concept. EHCP is a modular framework that involves lots of pieces. We have put the pieces together, but none are optimized at the moment.
2. There are many technical and conceptual details to discuss. We’re going to dive into many of these details in the coming blog posts. Additionally, we’re happy to discuss the paper with particularly interested parties.
That being said,
3. Please be patient! We’re posting the paper we submitted to the Big Data Bowl contest. We recognize that the write-up is somewhat technical and terse when it comes to the finer details of our methodology. Over the last few weeks, we’ve received some great feedback and questions from some of our friends and colleagues in sports and academia. Our plan in the next several posts is to respond to this feedback and hopefully address a bunch of initial questions. So please be patient with us; if you send us a bunch of burning questions and we don’t respond, it’s not entirely because we’re avoiding you.
That being said, here is a quick FAQ:
Q: Did you think about including other variables, such as QB pressure, time from snap to throw, defensive schemes, player information, etc?
A: We considered many variables, but had to limit scope due to time constraints. Incorporating additional variables is a clear opportunity for further work.
Q: Why BART?
A: Over an ever-growing range of problems, BART has demonstrated really great predictive performance with minimal hyperparameter tuning and without the need to pre-specify a specific functional relationship between inputs and outputs. While we didn’t do it in our analysis, BART can also be adapted to do feature selection. At the same time, it’s totally plausible that another regression technique would be effective for the problem.
Q: Did you consider other outcomes like YAC or expected yards gained?
A: We did. Ultimately, we may want to maximize expected value of a play E[value | input variables], which we can further decompose as:
E[value | input variables] = E[value | catch, x] * P(catch | x) + E[value | no catch, x] * P(no catch | x)
We focused on the P(catch | x) part.
Q: Wait a minute! You need to do a better job of modeling the conditional distribution of the unobserved variables on the observed ones. There is no way they are independent. Especially since they may change as the route develops.
A: That’s not really a question, but we agree. Handling the missing variables is one of the modular parts of the framework and can be optimized independently of the other parts. It is an interesting missing data question in its own right.
Q: Who is the best QB/WR?
A: We didn’t have enough data to draw any strong conclusions. Jameis Winston looked great in the data we had available to us.
Q: Does this run on the block chain?
Q: Did y’all try deep lear–