Wednesday, June 1, 2011

Logistic Regression/Markov Chain (LRMC)

The next MOV-based algorithm we'll look at is the Logistic Regression/Markov Chain (LRMC) model.  LRMC was developed by Joel Sokol and Paul Kvam at Georgia Tech.  It has gotten some press for being the best predictor of NCAA Tournament success over the past few years.  In 2010 it got 51/63 games correct, better than any other predictor.  (ISOV, a version of Iterative Strength Rating that uses MOV, was second.)

Sokol and Kvam have written several papers describing the LRMC model (one is available here).  The basic notion is similar to the Random Walkers model.  Each team has a certain number of votes, and in each iteration we move some of those votes to other teams, based upon that team's past performance.  In the Random Walkers model, we move votes based upon whether a team won or lost a game.  In LRMC, we move votes based upon the margin of victory.

In [Sokol 2006], the authors derive the following function to estimate the probability that Team A will beat Team B on a neutral site given that A beat B by "x" points on A’s court:

RH(x) = exp(0.292x-0.6228) / (1 + exp(0.292x-0.6228)
(The numeric factors in this equation were derived from analyzing home vs. home matchups using a logistic regression -- the details appear in the paper.)

If we realize that "Team A will beat Team B on a neutral site" means the same thing as "Team A is better than Team B", then RH(x) gives us the probability that A is really better than B.  We then use this probability to move "votes" between the two teams.

Of course, few NCAA basketball games take place on a neutral court, so we have to adjust our calculation to account for the HCA.  [Sokol 2006] calculates the HCA at 10.5 points (a large value not in line with other analysts; we'll return to this in a moment), so we have to take away the HCA when calculating the RH(x) for the home team.  If A beat B by 15 points at home, then RH(15-10.5) = 0.530, and A gets 53% of the "votes" that ride on this game.

If we plug RH(x) into our Random Walkers model, we get this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
LRMC [2006]71.3%11.65

This performance in on par with standard RPI.  One concern with this approach is that even if the home team wins by 40 points, it can only garner about 70% of the "votes" because the exponential function tails off very slowly.  Most college basketball fans would probably consider a win by 40 points near-certain proof that Team A was better than Team B.  So rather than give the away team a floor of 30%, we can split the remaining 30%, or even assign it all to the home team.  These approaches produce this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
LRMC [2006]71.3%11.65
LRMC [2006] +15% to home 70.5%11.62
LRMC [2006] +30% to home66.8%12.57

Neither of these proves to be an improvement.

[Sokol 2010] experimented with replacing the "RH" function derived by logistic regression with other models, and found that an empirical Bayes model was better.  (Technically, that makes the name LRMC no longer appropriate.)  Part of the motivation for this change was that the 10.5 point home advantage found in the logistic regression model was considerably different than the estimates of HCA by everyone else.  With the empirical Bayes model, the HCA is determined to be in the range 2-4, in line with other estimates.  The RH function for the new model is:

RH(x) = phi(0.0189x-0.0756) 
Plugging this into our Random Walkers model gives this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
LRMC [2006]71.3%11.65
LRMC [2010]71.8%11.40

This does prove to be an advantage over the original RH(x) function, but still not competitive with our best non-MOV predictor.

4 comments: