I took some time out recently to read through some of the basketball prediction papers from this year's MIT Sloan Sports Analytic Conference. Here are some thoughts...
Insights from the LRMC Method for NCAA Tournament Prediction
Mark Brown, Paul Kvam, George Nemhauser, Joel Sokol
MIT Sloan Sports Analytics Conference 2012
The latest paper from the LRMC researchers compares the performance of LRMC to over 100 other ranking systems as reported by Massey here. The measure of performance used is correct prediction of the NCAA tournament games. LRMC out-performs all of the other rankings, getting 75.5% correct over 9 years. The next best predictor did 73.5%. (I don't optimize my predictor on this metric, but it also gets about 73.5% correct.)
The LRMC work is always interesting and well done. A couple of notes that pop to mind:
(1) The advantage LRMC has over the other models is not huge. LRMC gets 75.5% correct; the 20th ranked model gets about 72% correct -- the difference is about 3 games per tournament. That's certainly significant, but in a test set of only 600 games, it may not be that significant. One very good year (or one very bad year) could move a rating significantly. It would be interesting to see the year-to-year performance of the ratings, but the authors don't provide that information.
(2) The authors assume there is no home court advantage (HCA) in the NCAA tournament and simply predict that the higher-rated team will win. In my testing, including an HCA for the higher-seeded team improves prediction performance. For example, this paper reports the performance of RPI as about 70% in predicting tournament games. In my testing, RPI with HCA predicted about 73% correctly. So the results may be skewed depending upon how much effect HCA has on each prediction model. (The authors don't use HCA for LRMC, so that model might do better as well.)
(3) In this paper, the authors test against all the matchups that actually occurred in the tournament -- that is, they do not "fill out a bracket" and commit to game predictions at the beginning of the tournament. In 2011, LRMC was included in the March Madness Algorithm Challenge and finished quite poorly -- outscored by all but three of the other entrants. (A similar result can be seen here.) Taking a look at the LRMC bracket for 2012 (here), LRMC got 22 correct picks out of the initial 36 games -- and got only one of the three play-in games correct, missed all of the upsets, and predicted two upsets that did not occur. Eight of the entries in the algorithm challenge picked more first-round games correctly. In fact, LRMC's only correct predictions in the entire tournament were higher seeds over lower seeds. And once again it would have lost the algorithms challenge.
(4) My own attempts to implement LRMC and use it to predict MOV (found here) have performed more poorly (around 72%) than the authors report in this paper. It may be that my implementation of LRMC was faulty, or that LRMC happened to perform slightly worse on my test data than on the tournament games used in this paper.
Moving on from the performance of LRMC, there are a couple of other interesting results in this paper. One is that home court advantage does not vary substantially from team to team. This confirms my own experiments. (I don't think I've reported on those experiments -- perhaps I'll write them up.) A second is that the natural variance in games is around 11 points, which matches closely what I've found. The last is that the authors found that the cliche "good teams win close games" doesn't seem to have any validity.
Can Statistical Models Out-predict Human Judgment?: Comparing Statistical Models to the NCAA Selection CommitteeAs with the LRMC paper, this paper looks at predicting tournament game outcomes. In this case, the author compares the NCAA committee seedings and RPI ratings to four different Bradley-Terry models. For more on Bradley-Terry models, see here.
Luke Stanke
MIT Sloan Sports Analytics Conference 2012
Stanke reports the results of testing these models against (approximately) the same games used in the LRMC paper:
The highest performing models are the Bradley-Terry models using only win/loss data. These two models correctly predicted approximately 89% of games in the NCAA tournament games from the past eight seasons. The next group of models is the Bradley-Terry Models using points a method for ranking teams. These models predicted over 82% of games correctly. The third group is the alternative models, the Committee Model, The RPI Model, and the Winning Percentage Model. These models range from 69.1% of games correctly picked to 72.9% of games correctly picked.This is certainly an interesting result -- particularly in light of the claims of the LRMC paper. According to the LRMC authors, LRMC's 75.5% success rate out-performed over 100 other rankings from Massey's page, and the Vegas line's success rate of ~77% is an upper-bound to performance.
So what explains this disparity? I didn't know -- so I sent off an email to the author. Luke Stanke emailed me to say that the result was caused by a coding error, and that actual performance was around 72%. (I know all about coding errors... :-) So his results here are in line with the expected performance for Bradley-Terry type rating systems. His conclusion remains unchanged -- that computer rating systems are better than the committee at selecting and seeding the tournament, and Bradley-Terry would be better than RPI. I won't disagree with either conclusion! :-)
Using Cumulative Win Probabilities to Predict NCAA Basketball PerformanceBashuk lists his affiliation as "RaceTrac Petroleum" so like me he appears to be an interested amateur in game prediction. In this paper he describes a system that uses play-by-play data to create "Cumulative Win Probabilities" for each team, and eventually, a rating. He uses this rating to predict game outcomes, and for the 2011-2012 season correctly predicts 72.6%. In comparison, Pomeroy predicts 77.7% correctly and the Vegas Opening Line 75.2%.
Mark Bashuk
MIT Sloan Sports Analytics Conference 2012
It is unclear to me after reading the paper exactly how CWP and the ratings are calculated. However, unlike most authors, Bashuk has made his code available on the Web. (URLs are provided in Appendix 1 of the paper.) This is very welcome to anyone trying to reproduce his results. Unfortunately for me, Bashuk's code is in SQL, which I don't understand well. So poring through it and understanding his process may take some time.