The Glicko Rating system was developed by Mark Glickman (chairman of the US Chess Federation (USCF) ratings committee) as an improvement upon ELO. The Glicko Rating system is very similar to Microsoft's TrueSkill rating system. Both are based on Bayesian reasoning and provide both a rating and an uncertainty. For two player games that don't produce ties (i.e., basketball) the only significant difference is that Glicko uses a logistic distribution of performance ratings rather than the Gaussian distribution used by TrueSkill. It seems unlikely that this small difference would result in a significant difference in predictive performance. (The Glicko-2 system adds a "volatility" factor, and this might be worth investigating at some point.)
The Prediction Tracker tracks the accuracy of various NCAA basketball rating systems both for won-loss and against the Las Vegas line. The highest rated system as of the end of the 2010-2011 season belonged to Jon Dokter. Dokter's ratings are intended to be predictive (he sells wagering advice for $10.99/week) but are not well-explained. This page provides a general overview of his methods, but there is not enough detail to replicate his rating system for testing.
Roy Bethel proposed a ranking system based upon "Maximum Likelihood Estimation" specifically to address the problem of sports with unequal strength of schedules, i.e., where teams do not play a complete round-robin. (His paper is available in the papers archive.) While Bethel's rating system appears interesting, it cannot handle winless or lossless teams. Since both happen with some regularity in college basketball (and are certain to occur for a significant portion of the season) using this rating is problematic.
NCAA Tournament-Specific Ratings
A number of people have created rating systems specific to predicting the NCAA tournament, e.g., Bradley West and Nate Silver. Most of these rely on seeding, human polls or other information that exists only for the tournament, and makes them unsuitable for predicting regular season games.
I'm interested in any pointers to other rating systems that I should investigate, particularly if they have a fundamentally different approach than the ones I've covered already. Send me an email (firstname.lastname@example.org) or leave a comment.
Summary of Results
In total, I tested about 110 algorithm variants. (Some multiple times as I uncovered errors in my code!) The following table summarizes the best performances for each algorithm:
|Predictor||% Correct||MOV Error|
There are a couple of interesting points to be gathered from this.
First, the best performer (TrueSkill) represents about a 15% improvement over always picking the home team, but only about a 2% improvement over the standard RPI. On MOV Error, it fares somewhat better, being a 22% improvement over picking the home team, and about 5% over RPI. But given the complexity of TrueSkill compared to the 1-Bit algorithm (or even standard RPI), that isn't as much improvement as we might have hoped to see.
Second, we note that the performance of our implementation of ELO is very close to the tracked performance of Sagarin ELO at Prediction Tracker. That gives us some confidence in these results. (On the MOV Error side, there seems to be about a 1.5 point bias in MOV Error between my measurements and those on Prediction Tracker. I don't know why that would be.)
Third, if we compare TrueSkill to the rating systems tracked at Prediction Tracker, we see it would have beaten all systems except Jon Doktor -- and his system makes use of MOV. So even without making use of MOV, we have a system that is competitive with the best systems available.