Showing posts with label krach. Show all posts
Showing posts with label krach. Show all posts

Monday, June 20, 2011

MOV-Based Terry-Bradley

Today I'm going to look at adding MOV to Bradley-Terry style ratings.

Recall that in Bradley-Terry style ratings, we're trying to select a rating R for every team such that the odds of Team I winning a matchup with Team J turn out to be equal to:

Oddsi = Ri / (Ri + Rj)
So how do we determine R?  The trick is to look at our set of historical outcomes and find the values for R that maximize the likelihood that what actually happened would have happened.  (This is called maximum likelihood estimation, or MLE.)  Various mathematically gifted folks figured out that you could iteratively determine the proper values for R using this equation:
Ri =Wi * 1 / (Ri + Rj)]-1
where Wi is the number of wins by Team I, and the sum is over all the games played by Team I.

Those of you who have been following along carefully from home will recognize this as the update equation from the KRACH rating, and indeed, KRACH is just Bradley-Terry for (originally) hockey teams.

Spartan Dan over at "The Only Colors" blog suggests in this posting a rating system that awards a variable amount for each win based upon the MOV:
W(mov) = 1 / (1 + e-mov/k)
In this equation, "k" is a scaling constant.  The effect of this equation is to award a 1/2 win for an MOV of zero, about 3/4 of a win for an MOV of "k", and then asymptotically to 1 win for greater MOVs.  The idea here is to limit the impact of "blowouts" on the rating system.

It's fairly easy to plug this equation for W into our implementation of KRACH.  With k=5 (as suggested by Spartan Dan), it gives this performance:

  Predictor    % Correct    MOV Error  
Govan (best)73.5%10.80
KRACH (original) 71.5%11.50
KRACH + MOV (k=5)  72.0%11.36
KRACH + MOV (k=10)  72.2%11.34

As you can see, it provides a small improvement over KRACH without MOV.  Some experiments show that performance is maximized for k=10, but not by a large amount over k=5.

As I noted in the original posting on KRACH, since the Bradley-Terry method is aimed at producing odds, it isn't particularly suited for predicting MOV.  (Although it's reasonable to think that a team with greater odds of victory is likely to win by more points.)  Even so, it's a little disappointing that adding MOV doesn't provide more improvement in the Bradley-Terry model.

Tuesday, May 24, 2011

RPI-Like Summary

I've re-implemented the incorrect algorithms and results are below.   But before summarizing the results so far, I'll briefly mention a few others that I did not implement.

Glicko

The Glicko Rating system was developed by Mark Glickman (chairman of the US Chess Federation (USCF) ratings committee) as an improvement upon ELO.  The Glicko Rating system is very similar to Microsoft's TrueSkill rating system.  Both are based on Bayesian reasoning and provide both a rating and an uncertainty.  For two player games that don't produce ties (i.e., basketball) the only significant difference is that Glicko uses a logistic distribution of performance ratings rather than the Gaussian distribution used by TrueSkill.  It seems unlikely that this small difference would result in a significant difference in predictive performance.  (The Glicko-2 system adds a "volatility" factor, and this might be worth investigating at some point.)

Jon Dokter

The Prediction Tracker tracks the accuracy of various NCAA basketball rating systems both for won-loss and against the Las Vegas line.  The highest rated system as of the end of the 2010-2011 season belonged to Jon Dokter.  Dokter's ratings are intended to be predictive (he sells wagering advice for $10.99/week) but are not well-explained.  This page provides a general overview of his methods, but there is not enough detail to replicate his rating system for testing.

Bethel Rank

Roy Bethel proposed a ranking system based upon "Maximum Likelihood Estimation" specifically to address the problem of sports with unequal strength of schedules, i.e., where teams do not play a complete round-robin.  (His paper is available in the papers archive.) While Bethel's rating system appears interesting, it cannot handle winless or lossless teams.  Since both happen with some regularity in college basketball (and are certain to occur for a significant portion of the season) using this rating is problematic.

NCAA Tournament-Specific Ratings

A number of people have created rating systems specific to predicting the NCAA tournament, e.g., Bradley West and Nate Silver.  Most of these rely on seeding, human polls or other information that exists only for the tournament, and makes them unsuitable for predicting regular season games.

Other Systems

I'm interested in any pointers to other rating systems that I should investigate, particularly if they have a fundamentally different approach than the ones I've covered already.  Send me an email (srt19170@gmail.com) or leave a comment.

Summary of Results

In total, I tested about 110 algorithm variants.  (Some multiple times as I uncovered errors in my code!)  The following table summarizes the best performances for each algorithm:

  Predictor    % Correct    MOV Error  
Naïve50.0%14.50
1-Bit62.6%14.17
Random Walkers71.0%11.72
RPI71.2%11.62
ELO71.8%11.59
KRACH71.5%11.50
Colley71.8%11.33
Wilson71.9%11.32
ISR71.9%11.32
Improved RPI72.1%11.30
TrueSkill72.8%11.09

There are a couple of interesting points to be gathered from this.

First, the best performer (TrueSkill) represents about a 15% improvement over always picking the home team, but only about a 2% improvement over the standard RPI.  On MOV Error, it fares somewhat better, being a 22% improvement over picking the home team, and about 5% over RPI.  But given the complexity of TrueSkill compared to the 1-Bit algorithm (or even standard RPI), that isn't as much improvement as we might have hoped to see.

Second, we note that the performance of our implementation of ELO is very close to the tracked performance of Sagarin ELO at Prediction Tracker.  That gives us some confidence in these results.  (On the MOV Error side, there seems to be about a 1.5 point bias in MOV Error between my measurements and those on Prediction Tracker.  I don't know why that would be.)

Third, if we compare TrueSkill to the rating systems tracked at Prediction Tracker, we see it would have beaten all systems except Jon Doktor -- and his system makes use of MOV.  So even without making use of MOV, we have a system that is competitive with the best systems available.

Friday, May 20, 2011

KRACH

College hockey seems to be a hotbed for ratings systems -- possibly because (like college football) it has many teams and few games.  One of the most popular ratings sytems for college hockey is Ken's Ratings for American College Hockey (KRACH).  KRACH is based upon Bradley-Terry rankings, which is a probablistic model for pairwise comparison, i.e., gives a method for ranking a set of entities by making only pairwise comparisons.  Since games are "pairwise comparisons" between teams, there's an obvious application to ranking sports teams.

Unlike some of the rating systems we've looked at, KRACH gives the ratings a specific meaning: an odds scale.  This means that if Team A has a rating of 200 and Team B has a rating of 100, then Team A is a 2:1 favorite to beat Team B.  If Team A and Team B play an infinitely long series of games, then Team A is expected to win 2/3 of those games and Team B to win 1/3.  If Team A and Team B play just a single game, then Team A is expected to win 0.66 of that game and Team B is expected to win 0.33 of that game.

(I'm reminded of the old joke that a statistician is a person who can put his head in an oven, his feet in a freezer, and on average be quite comfortable.)

So if we're given a team and its schedule (along with the KRACH ratings) we can calculated the expected number of wins by summing up the expected wins for each game.  For example, if a Team with a Krach rating of 100 plays Teams A, B, C, and D with ratings of (respectively) 150, 100, 75 and 50, then the expected number of wins is:
X vs A:  0.40 
X vs B:  0.50 
X vs C:  0.57 
X vs D:  0.67 
        ------
 Total:  2.14 
What the KRACH system does is try to pick ratings for all the teams (simultaneously) so that the expected number of wins for each team matches the actual number of wins.  As you might imagine, this requires an iterative approach.  The update equation for each iteration is given by this formula:
Ki = Vi / Sum over j of [Nij/(Ki+Kj)]
Where Ki is the KRACH rating for team i, Vi is the number of wins for team i, and Nij is the number of times that team i and j have played each other.  To avoid problems with undefeated and winless teams, we also need to add in a "tie" game with a fictitious team with an arbitrary rating.

There's one other adjustment necessary for using the KRACH ratings for prediction.  Since the KRACH ratings represent odds, when we compare two ratings to predict an outcome, we need to use a ratio of the ratings:
Ki/(Ki+Kj)
rather than use Ki and Kj directly in our linear regression.

Implementing the KRACH rating and testing with our usual methodology yields this performance:

  Predictor    % Correct    MOV Error  
Wilson77.7%10.33
KRACH71.5%11.50

Once again, this rating does not provide an improvement on our best rating so far.  Because the KRACH rating is an odds rating, we might expect the MOV Error to be higher, but even on predicting outcome, KRACH is not an improvement on Wilson.