Showing posts with label wilson. Show all posts
Showing posts with label wilson. Show all posts

Saturday, August 27, 2011

Correlation Between Predictors

Danny Tarlow was kind enough to give me some comments on a paper I'm writing about the work reported in this blog, and one of his suggestions was to look at whether the predictors I've tested are picking up on the same signals.  This is a significant question because if the predictors are picking up on different signals, then they can be combined into an ensemble predictor that will perform better than the individual predictors.  (Dietterich 2000) showed that
"...a necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse."
A classifier is accurate if is better than random guessing.  Two predictors are diverse if they make different errors.  Intuitively, an ensemble will perform better than the base predictors if the errors in the base predictors are uncorrelated and tend to cancel each other out.  Our predictors are all obviously accurate, but are they diverse?

To test this we can measure the correlation between the errors made by the different predictors.  If they are uncorrelated, then it is likely that we can construct an ensemble with improved performance.  I don't have the time and energy to test all combinations of the predictors I've implemented, but here are the correlations between the top two won-loss based predictors (Wilson, iRPI) and the top two MOV-based predictors (TrueSkill+MOV, Govan):


WilsoniRPITrueSkill
+ MOV
iRPI0.99

Trueskill+MOV0.930.93
Govan0.950.950.98

Not unsurprisingly, the highest correlations are between the two won-loss predictors and the two MOV-based predictors.  But all of the predictors are highly correlated.  The least correlated (by a hair) are Wilson and TrueSkill+MOV.  Putting those two predictors into a combined linear regression or an averaging ensemble results in performance worse that TrueSkill+MOV alone.

On the other hand, perhaps using the best predictors is the wrong course.  Perhaps its more likely that the worst predictors are uncorrelated with the best predictors, and a combination of one of the worst with one of the best would be fruitful.


WilsoniRPITrueSkill
+ MOV
Govan
1-Bit0.830.830.800.79
Winning Percentage0.970.980.920.93

As this shows, even the 1-Bit predictor ("the home team wins by 4.5") is highly correlated with the better predictors, and using just the winning percentage shoots the correlation to 0.92+.  Adding these predictors to an ensemble with the better predictors also results in worse performance.

Of course, it's always possible that some combination of predictors will improve performance.  There's been some interesting work in this area -- see (Caruana 2004) in Papers.  But for right now I don't have the infrastructure to search all the possible combinations.

Tuesday, May 24, 2011

RPI-Like Summary

I've re-implemented the incorrect algorithms and results are below.   But before summarizing the results so far, I'll briefly mention a few others that I did not implement.

Glicko

The Glicko Rating system was developed by Mark Glickman (chairman of the US Chess Federation (USCF) ratings committee) as an improvement upon ELO.  The Glicko Rating system is very similar to Microsoft's TrueSkill rating system.  Both are based on Bayesian reasoning and provide both a rating and an uncertainty.  For two player games that don't produce ties (i.e., basketball) the only significant difference is that Glicko uses a logistic distribution of performance ratings rather than the Gaussian distribution used by TrueSkill.  It seems unlikely that this small difference would result in a significant difference in predictive performance.  (The Glicko-2 system adds a "volatility" factor, and this might be worth investigating at some point.)

Jon Dokter

The Prediction Tracker tracks the accuracy of various NCAA basketball rating systems both for won-loss and against the Las Vegas line.  The highest rated system as of the end of the 2010-2011 season belonged to Jon Dokter.  Dokter's ratings are intended to be predictive (he sells wagering advice for $10.99/week) but are not well-explained.  This page provides a general overview of his methods, but there is not enough detail to replicate his rating system for testing.

Bethel Rank

Roy Bethel proposed a ranking system based upon "Maximum Likelihood Estimation" specifically to address the problem of sports with unequal strength of schedules, i.e., where teams do not play a complete round-robin.  (His paper is available in the papers archive.) While Bethel's rating system appears interesting, it cannot handle winless or lossless teams.  Since both happen with some regularity in college basketball (and are certain to occur for a significant portion of the season) using this rating is problematic.

NCAA Tournament-Specific Ratings

A number of people have created rating systems specific to predicting the NCAA tournament, e.g., Bradley West and Nate Silver.  Most of these rely on seeding, human polls or other information that exists only for the tournament, and makes them unsuitable for predicting regular season games.

Other Systems

I'm interested in any pointers to other rating systems that I should investigate, particularly if they have a fundamentally different approach than the ones I've covered already.  Send me an email (srt19170@gmail.com) or leave a comment.

Summary of Results

In total, I tested about 110 algorithm variants.  (Some multiple times as I uncovered errors in my code!)  The following table summarizes the best performances for each algorithm:

  Predictor    % Correct    MOV Error  
Naïve50.0%14.50
1-Bit62.6%14.17
Random Walkers71.0%11.72
RPI71.2%11.62
ELO71.8%11.59
KRACH71.5%11.50
Colley71.8%11.33
Wilson71.9%11.32
ISR71.9%11.32
Improved RPI72.1%11.30
TrueSkill72.8%11.09

There are a couple of interesting points to be gathered from this.

First, the best performer (TrueSkill) represents about a 15% improvement over always picking the home team, but only about a 2% improvement over the standard RPI.  On MOV Error, it fares somewhat better, being a 22% improvement over picking the home team, and about 5% over RPI.  But given the complexity of TrueSkill compared to the 1-Bit algorithm (or even standard RPI), that isn't as much improvement as we might have hoped to see.

Second, we note that the performance of our implementation of ELO is very close to the tracked performance of Sagarin ELO at Prediction Tracker.  That gives us some confidence in these results.  (On the MOV Error side, there seems to be about a 1.5 point bias in MOV Error between my measurements and those on Prediction Tracker.  I don't know why that would be.)

Third, if we compare TrueSkill to the rating systems tracked at Prediction Tracker, we see it would have beaten all systems except Jon Doktor -- and his system makes use of MOV.  So even without making use of MOV, we have a system that is competitive with the best systems available.

Tuesday, May 10, 2011

Wilson Rating

(Note:  My original implementation of the Wilson rating had an error.  See "Whoops!"

The next RPI alternative we'll look at comes from David Wilson.  Wilson's rating system was developed for use with college football, but applies equally well to college basketball.  Wilson describes this system this way:
A team's rating is the average of its opponents' ratings plus 100  for a win or minus 100 for a loss.  Wins that lower the rating and losses that raise the rating count one twentieth as much as the other games.  Post-season games count double.
A comparison of this system to the Iterative Strength Rating shows one major difference.  In the Wilson Rating, wins that lower a team's rating and losses that raise a team's rating are heavily discounted.  Other than this, the only differences between the systems are the initial values and the size of the win/loss bonus.  Both ratings are calculated with an iterative algorithm to hone in on the final values.

Implementing the standard Wilson algorithm and testing it with our usual methodology gives this result:

  Predictor    % Correct    MOV Error  
ISR77.7%10.45
Wilson77.7%10.33

This shows a slight improvement in MOV Error over the ISR rating.

There are only a few tweaks we can easily apply to the Wilson Rating.  One is to use our MOV cutoff to exclude close games from the ratings.  This didn't improve ISR, so it likely won't improve Wilson either:

  Predictor    % Correct    MOV Error  
ISR77.7%10.45
Wilson77.7%10.33
Wilson (mov=1)77.6%10.34
Wilson (mov=4)76.9%10.35

And indeed it doesn't.  Another possibility is to tweak the amount that some games are discounted.  Wilson himself initially set the discount to .01 and later raised it to .05.  If the discount is set to 1.0, Wilson is functionally equivalent to ISR; if it is set to 0.0 it maximally discounts those games, so we can try those two limit cases to see how performance is affected:

  Predictor    % Correct    MOV Error  
ISR77.7%10.45
Wilson77.7%10.33
Wilson (weight=1.0)77.6%10.35
Wilson (weight=0.0)77.6%10.34


Interestingly, the game discounting doesn't seem to have much of an impact.  There is a performance improvement that maximizes around 0.15 but the improvement is not substantial.

Except for the game discounting, the primary difference between the Wilson rating and the ISR is the size of the bonus given for a win or a loss.  My implementation of ISR uses a bonus of 4, and the standard Wilson algorithm uses 100, so we can try a range of values to see where performance is maximized:

  Predictor    % Correct    MOV Error  
ISR (bonus=4)77.7%10.45
Wilson (bonus=100)77.7%10.33
Wilson (bonus=25)77.5%10.42
Wilson (bonus=250)77.7%10.33

Performance appears to maximize for values >100.  This is somewhat counter-intuitive -- one would imagine that with an iterative algorithm like ISR/Wilson, the size of the win bonus would only affect the time to reach a steady-state solution, but apparently it can affect (at least marginally) the accuracy of the solution as well.

So the Wilson Rating with the bonus set at 100 and game discounting at 0.15 is our new champion RPI-like rating system.