Saturday, June 25, 2011

A Lack of Motivation

I am currently on vacation in Turks & Caicos, which turns out to have pretty good Internet (assuming I sit out on the balcony, anyway) but I have to say my motivation to analyze college basketball statistics has taken a serious beating :-).  I actually started to take a look at John Pugh's "ComPughter Rating" system on the flight down to the island, but we hit our approach before I managed to finish.  So I can't promise anything for the next week or so; still you never know.  Figuring out matrix inversions might be just the antidote to a diet of snorkeling and rum :-).

Thursday, June 23, 2011

Another Experiment

I don't recall now what prompted me, but I recently decided to take a look at how the performance of the predictors works temporally.  The same way I broke the games down by mismatches as I did in this post, I would subdivide the games according to when they occurred in the season: early, middle, or late.

I have about 4000 games in the training data for each season (2009, 2010, 2011) so I actually broke those down into quarters to give me an even 1000 games in each portion of the season.  (I remind the reader that I remove the first 1000 games of the season from the training set because the ratings aren't yet reliable.  So when I talk about the first 1000 games in this posting, those are actually the second thousand games played in that season.)  I then tested the performance of the two best predictors: TrueSkill + mov adjustment, and the Govan ratings.  In this case, I think it's easier to look at a graph of the results rather than a table.  Let's start with the MOV Error by quarters of the season:

MOV Error By Quarter

As you might expect, the error goes down (performance improves) as the season progresses.  On average, the improvement is slightly more than 1 point.  (In 2009, both ratings perform slightly worse in the 4th quarter of the season, but the overall trend seems clear.)  Presumably more games leads to a more accurate rating and hence a more accurate prediction.

Now let's look at a similar chart for the percentage of correct predictions:

% Correct Predictions by Quarter
This chart is a complete curveball.  Performance at correctly picking the winner is by far best in the first quarter, drops dramatically through the third quarter, and then rebounds substantially in the fourth quarter.  This is completely unexpected, particularly in light of the MOV Error, which performs as we'd intuitively expect.

I can hazard some guesses as to what is going on, but none seem particularly explanatory.  More experiments would seem to be in order.

One question is whether the prediction performance is high in the first quarter because (1) the rating is only operating on a few games, or (2) the teams are playing differently (more predictably) in that quarter.  To test that hypothesis, I could restart the rating at the beginning of each quarter.  If (1) is true, then we'd expect to see similar performance in all four quarters.

Another thought is that teams play more non-conference games in the first quarter of the season.  To test this hypothesis, I could do a comparison of performance prediction on conference vs. non-conference games.

In any case, this is a very puzzling (and hopefully eventually enlightening) result!

Wednesday, June 22, 2011

Comment Spam

Dear Comment Spammers:

This blog has six readers who -- like the author -- are pasty, basement-dwelling nerds who care more about the esoterics of Bayesian rating formulas than anything you might be selling.  No matter how minimal an effort is required to post your spam to this blog, I assure you that it has a negative payoff.  You'd be better off finding a "Hello Kitty"/"Justin Bieber" crossover blog -- it will probably have more readers with bigger allowances.


Your Blog Author

An Idle Experiment

I exchanged an email with Dan Baker (aka Spartan Dan) in which we discussed how to measure performance for predictors.  He pointed me towards this posting where Ken Pomeroy looks at performance broken down into "bands" of predicted win percentages.  That's an interesting way of looking at the data, although it's only applicable to rating systems that predict a winning percentage (like Bradley-Terry).  I've done similar sorts of analysis in the past, and this spurred me to repeat some of those experiments.

I generated the usual test data for the TrueSkill rating system, but instead of testing against the whole data set, I filtered down to certain subsets.  (I also changed the parameters of the cross-validation slightly because some of the subsets are relatively small -- that's why the performance on the entire data set doesn't match the previous version.)  In the first experiment, I created subsets where the difference between the TrueSkill ratings for the home team and the away team was either greater than 40, less than -40, or in-between.  (This roughly corresponds to predicting a ten point win by the home team, a ten point win by the away team, or somewhere in-between.)  Here are the results for those subsets (as well as the results for the whole data set for comparison):

  Predictor    % Correct    MOV Error  
TrueSkill 73.4%10.96
Delta TS > 4095.7%11.60
Delta TS < 40 && > -4068.9%10.85
Delta TS < -40 79.3%10.71

There are a number of interesting results here.  First, home teams with a big ratings advantage rarely lose and we consequently have a very good winning percentage for those games.  However, our MOV Error is at its worst.  This might be because of blowouts, or more volatility in these games.  Second, our prediction rate in "close" games is very poor -- only slightly better than our naive 1-bit predictor (although it would presumably do worse on this subset).  However, our MOV Error is pretty good -- probably because it's an absolute measure, and close games are likely to have a tighter MOV spread.  Finally, we do a lot worse predicting games where the away team has the big advantage.  We (essentially) predict all wins by the away team, but in reality the home team manages to win more than 20% of those games.

Recall that we use linear regression to create an equation that predicts the MOV.  That equation looks something like this:

MOV = Hc * (Home Team's TrueSkill Rating) + Ac * (Away Team's TrueSkill Rating) + C
We can give some intuitive (if not perhaps completely correct) meanings to these coefficients.  C represents the "Home Court Advantage" -- it is the bonus (if positive) the home team gets for playing at home.  The ratio of Hc to Ac represents the relative advantage (or disadvantage) the home team has over the away team.

So let's compare the coefficients produced for each of the subsets:

  Predictor  Hc   Ac 
TrueSkill 0.249-0.2383.48
Delta TS > 400.249-0.2513.77
Delta TS < 40 && > -400.250-0.2353.29
Delta TS < -40 0.148-0.164-0.373

The first interesting thing that jumps out is the equation for big home underdogs, which is quite different from the other three.  In particular, if we think of C as the Home Court Advantage, it completely disappears when the home team is a big underdog.  Even worse for the home team, this is the subset where their relative advantage (Hc to Ac) is at its worst.  But somehow the home team manages to win 20% of these games! There's probably something interesting to be found in that anomaly...  Another interesting thing to note is that when home teams are big favorites, they actually slightly underplay their strength (relative to the away team) although this is somewhat compensated by a greater HCA.

This is far from a rigorous experiment, but it does illustrate that the overall performance of our model can obscure radically different performance on included subsets.  It also suggests that we might want to build our predictor to use different models (or entirely different approaches) for different subsets of the data.  (But interestingly, the "sum" of the three subset models actually slightly under-performs the single overall model.  So maybe this intuition is wrong!)  In particular, it might be worthwhile to start tracking the performance of our candidate models for "close games", since that's clearly the place where significant performance gains can be had.

Tuesday, June 21, 2011


I said earlier that I wouldn't be assessing the Glicko rating system because it was basically the same as TrueSkill, so I didn't expect significantly better performance than TrueSkill.  That hasn't changed, but I decided to take a look at Glicko anyway, mainly because I've become intrigued with the Whole History Rating algorithm from Rémi Coulom and that led (indirectly) back to Glicko. I haven't yet wrapped my head around WHR (and may never do so -- the math is difficult) but I was interested enough to code up the Glicko algorithm.

I won't attempt to explain the Glicko algorithm -- it's described in detail on Mark Glickman's website -- but just jump into the performance.  My implementation recalculates a team's rating after every game -- the same as done with TrueSkill.  There is very little to tweak in Glicko, so here's the basic performance:

  Predictor    % Correct    MOV Error  

As with TrueSkill we can hack an MOV-based factor into Glicko and improve our accuracy somewhat:

  Predictor    % Correct    MOV Error  
Glicko + MOV71.8%11.56

Even so Glicko isn't a competitor for "best of" the rating systems.

Monday, June 20, 2011

MOV-Based Terry-Bradley

Today I'm going to look at adding MOV to Bradley-Terry style ratings.

Recall that in Bradley-Terry style ratings, we're trying to select a rating R for every team such that the odds of Team I winning a matchup with Team J turn out to be equal to:

Oddsi = Ri / (Ri + Rj)
So how do we determine R?  The trick is to look at our set of historical outcomes and find the values for R that maximize the likelihood that what actually happened would have happened.  (This is called maximum likelihood estimation, or MLE.)  Various mathematically gifted folks figured out that you could iteratively determine the proper values for R using this equation:
Ri =Wi * 1 / (Ri + Rj)]-1
where Wi is the number of wins by Team I, and the sum is over all the games played by Team I.

Those of you who have been following along carefully from home will recognize this as the update equation from the KRACH rating, and indeed, KRACH is just Bradley-Terry for (originally) hockey teams.

Spartan Dan over at "The Only Colors" blog suggests in this posting a rating system that awards a variable amount for each win based upon the MOV:
W(mov) = 1 / (1 + e-mov/k)
In this equation, "k" is a scaling constant.  The effect of this equation is to award a 1/2 win for an MOV of zero, about 3/4 of a win for an MOV of "k", and then asymptotically to 1 win for greater MOVs.  The idea here is to limit the impact of "blowouts" on the rating system.

It's fairly easy to plug this equation for W into our implementation of KRACH.  With k=5 (as suggested by Spartan Dan), it gives this performance:

  Predictor    % Correct    MOV Error  
Govan (best)73.5%10.80
KRACH (original) 71.5%11.50
KRACH + MOV (k=5)  72.0%11.36
KRACH + MOV (k=10)  72.2%11.34

As you can see, it provides a small improvement over KRACH without MOV.  Some experiments show that performance is maximized for k=10, but not by a large amount over k=5.

As I noted in the original posting on KRACH, since the Bradley-Terry method is aimed at producing odds, it isn't particularly suited for predicting MOV.  (Although it's reasonable to think that a team with greater odds of victory is likely to win by more points.)  Even so, it's a little disappointing that adding MOV doesn't provide more improvement in the Bradley-Terry model.

Thursday, June 16, 2011

PageRank Rating

After my post on the Govan rating, I exchanged a few emails with Anjela Govan.  She is (sadly :-) no longer working on sports ratings / predictions, but she reminded me that she'd also worked on adapting the Google PageRank algorithm to sports ratings.  I actually already had a version of her paper, and it was on the queue to be evaluated, but this prompted me to look at it next.

She also gave me a pointer to her thesis, which is online here.  I haven't yet had a chance to read it, but judging by the contents it has a lot of interesting material.  It includes Matlab code as well, for those of you who are Matlabbers.  (My own thesis is much less interesting :-)
As most people are probably aware, the PageRank algorithm was developed by Larry Page and Sergey Brin as a method for making enormous amounts of money.  It also, coincidentally, turned out to be a great way to determine which search results are most relevant.  The basic idea is that every link to a page acts as a "vote" for the relevancy of that page.  The value of that vote depends upon the relevancy of the voting page, and so on ad infinitum.  As you might guess if you have been following this blog, this is essentially an iterative algorithm, and in fact ends up in a formulation very similar to the matrix computation in LRMC

In Govan's application of PageRank to rating sports teams, we construct an NxN matrix "A", where N is equal to the number of teams.  When Team J beats Team I by MOV points, we set A[I,J]=MOV.  This is a "vote" from Team I to Team J -- and the strength of this vote is proportional to the margin of victory.

Next we adjust this matrix in several ways.  First, we scale each row so that it sums to 1.  Second, we have to fix undefeated teams.  Since they haven't lost a game, they aren't voting for anyone else -- their row in A is all zeros.  This causes a problem when trying to solve for the ratings (it makes the matrix irreducible).  There are many possible solutions, but the one Govan chooses is to set every value in an undefeated team's row to 1/N -- in other words, it splits its vote equally amongst all the teams.  Finally, we mix this matrix A with a "personalization" matrix V.  Again, this personalization matrix can be anything at all, but Govan chooses to use a simple matrix with every value set to 1/N.  The mix is done like this:
G = 0.85 * A + 0.15 * V

The constant 0.85 is called alpha, and is another arbitrary value.  Having done all that, we now solve for the eigenvector of G using the Power Method (as we did with LRMC) but iteratively solving the equation:

πk+1 = πkG
The resulting vector π represents the team ratings.

So we go off and merrily code this up and test it, and we find this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (best)73.5%10.80

Sadly, not very impressive performance.  There are a number of ways to tweak the PageRank algorithm. The most interesting turns out to be the factor "alpha" that controls the mix of the matrix A (based on game results) and matrix V (all teams ranked equally).  The best-performing mix turns out to be when alpha is around 0.10:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (best)73.5%10.80
PageRank (alpha=0.85)66.4%12.86
PageRank (alpha=0.10)70.8%11.79

PageRank still isn't competitive with the best in breed of our algorithms, but it does improve considerably with this tweak.  What's fascinating is that this balance would seem to almost completely discount the game results.  I have no intuition why this sort of lopside discounting of the game results should perform so well.  It would be interesting to explore whether a similar approach could improve the performance of LRMC (which has a very similar formulation to PageRank).

Other tweaks that I have tried (primarily tweaking the correction for undefeated teams) did not improve performance.  Several variants suggest themselves to me, so I may try a few other experiments as time permits.

Wednesday, June 15, 2011

An Experiment with TrueSkill

At the moment I'm looking at the TrueSkill rating system and thinking about how to incorporate margin of victory (MOV).  (In particular, I'm watching (and thinking about) Tom Minka's lectures from the Machine Learning Summer School 2009.  These lectures are a great resource, by the way, and well worth browsing.)  As the best of the ratings that don't use MOV, it seems reasonable that TrueSkill might be even better if it incorporated MOV.    I haven't yet formulated a mathematically sound way to add MOV to TrueSkill (I'm open to pointers) but that of course has not kept me from experimenting.

Recall that in TrueSkill we update the ratings for the two teams involved in a game by comparing the strengths.  If you win a game over a strong opponent, than that's good evidence that your rating ought to rise and your opponent's fall.  And if you win a game over a weak opponent, than that's not good evidence to change the ratings (because you were expected to win).

So how should we interpret MOV?  One reasonable approach is to say that a win by a large MOV is better evidence that your rating should rise than a win by a small MOV.  (For the moment we ignore "Running Up the Score" and similar problems with MOV.)  Referring back to how TrueSkill works, winning by a large MOV is therefore similar to beating a stronger team.  So perhaps we can incorporate MOV into the TrueSkill algorithm by adjusting our opponent's rating up or down based upon MOV (creating an "effective" rating) and then updating our own rating accordingly.

That turns out to be pretty straightforward to add to the algorithm, and gives these results:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (best)73.5%10.80
TrueSkill (w/ MOV)73.3%10.91

This turns out to work surprisingly well for a completely arbitrary hack.  Some playing around with the bonusing function shows that performance is slightly improved by using MOV*2 as the bonus:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (best)73.5%10.80
TrueSkill (w/ MOV)73.3%10.91
TrueSkill (w/ MOV*2)73.3%10.88

Other bonusing variants and tweaks don't show any improvement.  This performance is not quite as good as the Govan rating, but certainly shows some promise.

Saturday, June 11, 2011

Govan Ratings

The next system will take a look at is based on [Govan 2009].  Govan calls this the "Offense-Defense Rating" but to avoid confusion with my own Offense-Defense Rating, I will refer to this system as the "Govan Rating."

The idea behind the Govan Rating is to generate an Offense and a Defense rating for every team based upon scoring.  Assuming that "Scoreij" represents the points scored by Team j against Team i, then the Offense rating for Team j is given by the formula:
Offensej = (Score1j/Defense1) + ... + (Scorenj/Defensen)
That is, Offense of Team J is the sum over all the teams of the points scored by Team J divided by the Defense rating of the opposing team.  When the opposing team is a good defender (i.e., has a low Defense rating), then Team J gets a relatively higher Offense rating.  When the opposing team is a poor defender (i.e., has a high Defense rating), then Team J gets a relatively lower Offense rating.

The formula for calculating the Defense rating for Team I is the reverse:
Defensei = (Scorei1/Offense1) + ... + (Scorein/Offensen)
Again, you are rewarded for holding down the scoring of teams with potent offenses, and penalized for giving up points to teams with weak offenses.

Since Offense and Defense are interdependent, they must be iteratively approximated.  The method varies slightly from what we've seen before.  In this case, we will estimate the (k)th iteration of Offense using the (k-1)th iteration values for Defense, and then the (k)th iteration of Defense using the (k)th iteration values for Offense that we just calculated (not the (k-1)th values as you might have expected).

Not unsurprisingly, this can be expressed as matrix calculations.  See the paper for details.

One significant problem with this system is that the ratings do not (necessarily) converge if every team doesn't play every other team.  We can solve this by adding a phantom tie game between every pair of teams. 

There are a couple of interesting features of this system.  First, this system doesn't consider Won-Loss records at all; you get rewarded equally regardless of whether you score 80 points in a loss or in a win.  Second, this rating system rewards teams that play more games.  A team that plays 30 games is going to have a higher Offense rating than one that plays only 22.

So how does Govan perform?

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (baseline)71.8%11.44

The baseline performs about as well as the improved LRMC.  One potential problem here is that the linear regression can only combine the Offense and Defense terms additively.  Given how they are defined, we might guess that Offense/Defense would be a better predictive rating than Offense + Defense.  If we define a combined rating Combined = Offense/Defense and add that to the model we get a slight improvement:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (baseline)71.8%11.44
Govan (w/ Combined)71.8%11.43

For the model above, I had set the phantom tie game to a score of 68-68.  (68 is the average score in NCAA basketball, so I reasoned this would perturb the ratings the least.)  [Govan 2009] uses small numbers for this tie game, so as an experiment, I set the score of the phantom tie game to 1-1, and then to 0.1-0.1:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (baseline)71.8%11.44
Govan (w/ Combined)71.8%11.43
Govan (w/ Combined; 1-1)72.2%11.26
Govan (w/ Combined; 0.1-0.1)72.6%11.21

This provides a significant performance improvement, although not yet competitive with the best of breed.

As mentioned above, one of the odd features of this system is that it rewards playing more games.  We can try to address that by dividing the Offense and Defense ratings by the number of games played:
Offensej = (1/Nj)* [(Score1j/Defense1) + ... + (Scorenj/Defensen)]
Defensei = (1/Nj)* [(Scorei1/Offense1) + ... + (Scorein/Offensen)]
This will hopefully remove any reward for playing more games while still permitting the solution to converge:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (w/ Combined; 0.1-0.1)72.6%11.21
Govan (w/ Combined; 0.1-0.1, normalized)73.5%10.80

And so it does, providing a big boost to performance -- making this version of Govan the best performing rating so far!

Tweaks like eliminating close games or blowouts don't really apply to this rating, because the rating is based solely on points scored, and not the game outcome.  Nonetheless, it's worth a quick experiment to see if ignoring close games has a positive effect:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
Govan (best)73.5%10.80
Govan (mov-cutoff=2)73.5%10.82

Answer: no.

We'll stop for now with tweaking this rating.  One thing we'll want to look at in the future is combining this with other ratings.  Because Govan does not use the Won-Loss record, it may capture information not captured by ratings like TrueSkill that don't consider scores at all, and so properly combining the two might improve the overall performance.  Or not -- that's an experiment for another day!

Thursday, June 9, 2011

LRMC Revisited

As mentioned last time, the poor performance of LRMC in my testing had me concerned that I had misunderstood some essential element of the system, or had a mistake in my implementation.  So I took the time over the last few days to re-read the original papers and create a second implementation.  My original implementation adapted the Random Walkers model, essentially simulating the Markov Chain network iteratively until it settled to steady-state values.  An alternative approach is to solve the Markov Chain problem via linear algebra.  There are a couple of possible approaches, but (simply because it was easiest to implement) I settled on the Power Method.  The basic idea is to calculate the transition values of the Markov Chain and capture them in a large matrix (~350x350, the number of NCAA Div I Basketball teams) which we call T, and then create a 350x1 matrix containing an initial guess at the ratings of all the teams, which we call π.  We then perform the following calculation iteratively:

πk+1 = πkT
Until π reaches a steady state.  (In this case, 8 iterations gives us accuracy to 4 digits.)  With a 350x350 transition matrix, it is also feasible to solve this analytically as a system of linear equations, but I didn't have a library handy for that computation.

Here are the comparative results of the two different implementations:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
LRMC (random walkers, 2006)71.3%11.65
LRMC (matrix, 2006)71.5%11.65
LRMC (random walkers, 2010)71.8%11.40
LRMC (matrix, 2010)72.2%11.37

(2006 and 2010 here refer to the two different LRMC functions from [Sokol 2006] and [Sokol 2010].)

This shows some small differences between the implementations (probably due to different degrees of accuracy) but no significant errors. So it would appear that despite its good showing in predicting recent NCAA Tournament results, LRMC is not as accurate as TrueSkill, IMOV and some of the other ratings.

Wednesday, June 8, 2011

Iterative Strength of Victory (ISOV)

The next MOV-based rating system we'll look at is "Iterative Strength of Victory" or ISOV.  ISOV is the invention of Paul Kislanko.  ISOV is based upon the Iterative Strength Rating (ISR).  Recall that ISR is defined as an iterative calculation:

Begin with all teams set to an even rating -- 100 in this case. Then, for each game played, give each team the value of their opponent's rating plus or minus a factor for winning or losing the game -- 25 in this case. Total all of a team's results, divide by the number of games played, and that's the end of a cycle. Then use those numbers as the start of the next cycle until you get the same results for each team for two consecutive cycles.
ISOV replaces the 25 point factor for winning (or losing) a game with a factor that is based upon the Margin of Victory (MOV).   The factor is calculated in two steps.  First, we calculate a "strength of victory":

SOVgame = (winning score - losing score)

( winning score + losing score )

This will be a percentage from 0 (in the case of a tie) to 1 (in the case of a shutout), although obviously both of those results are unlikely/impossible in college basketball.  The idea here is to capture the intuitive notion that a 42-34 win is stronger than an 82-74 win, even though in both case MOV=8.  (I'm not sure I entirely agree with this, by the way.  Is a 42-40 win really twice as good -- or even any better -- than an 82-80 win?)  The next step is scaling the SOV:

SOVscaled =


(average SOV for all games)

The notion here is to convert our SOV percentage to a measure that is scaled around the average -- an average win will be worth 1, a less than average win less, and a more than average win more.

Finally, we'll multiply the scaled SOV by the average points per game:

Factor = (average points per game per team)×
SOVscaled   [1]

This step seems arbitrary -- there's no particular reason to scale by the average points per game -- particularly since the scaled SOV isn't related to scoring.  Most ELO-type systems (like this one) try to have a Factor of about 1/4 of the base weight (e.g., ISR uses a base weight of 100 and a factor of 25).  However, because this is an iterative system, it isn't particularly sensitive to size of the factor (small factors just lead to more iterations), and testing ISOV with different values for the left-hand side of this equation show little impact.  But at any rate this is how Kislanko defines ISOV, so we'll take this as a starting point.

Looking at the 2010 basketball season, the "average points per game per team" was about 68 points, and the "average SOV for all games" was 0.0866.  Plugging these numbers into the formula(s) above and testing shows this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01

This is pretty good -- the best of the MOV-based ratings so far and competitive with the best of the RPI-like ratings.

There are a several variants/experiments we can try.  As mentioned above, changing the left-hand side of [1] has little impact on performance.  We can also try changing the right-hand side to eliminate the second step, where we scale SOV to the average, and instead use just the raw SOV (which is the MOV as a percentage of the total scoring in the game):

Factor = Weight×SOVgame

 And we can choose "Weight" to give us an average Factor close to 25 (for the reasons expressed above):

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
ISOV (no scaling)72.6%11.10

This is only slightly worse than using the scaled SOV, so apparently scaling the SOV does not have a huge benefit.  We can go a step further and eliminate SOV entirely, and just use the actual MOV:

Factor = Weight×MOVgame

Again, we choose Weight to make the average Factor about 25.  Over the last three seasons the average MOV was 11.5 points, so a Weight of ~2 will work:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
ISOV (MOV)72.7%11.05

Interestingly, this actually provides a little better performance than SOV.  This implies that (as I speculated above), a two point win may be just as good evidence that one team is better than another regardless of whether it comes in a 42-40 contest or an 82-80 contest.

As long as we're playing with the Factor function, we can try compressing it (to reduce the impact of winning by a greater margin) or expanding it (to increase the impact of winning by a greater margin).  A simple way to do this is to take the square root (or log, or square) of our previous Factor function.  (I used the equation from [1] for this experiment.)

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
ISOV (sqrt)72.9%11.19
ISOV (log)71.8%11.27
ISOV (square)71.4%11.48

Compressing/expanding the Factor function doesn't improve our results.  Another interesting experiment is to crossbreed with LRMC.  We can borrow the "RH" functions from LRMC and with a little tweaking use them to generate the bonus Factor for use in ISOV:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
ISOV (RH [2006])71.4%11.37
ISOV (RH [2010])71.2%11.45

Both the "RH" functions perform more poorly than ISOV.  Interestingly, the newer RH function -- which performs better in the LRMC rating -- performs more poorly when plugged into the ISOV rating.  (I'm a little disappointed in the LRMC results, and I will probably revisit them to make sure that my understanding/implementation of LRMC is correct.)

Returning to the version of ISOV that uses the raw MOV as the bonus function (which we'll call IMOV for now), we can speculate on some possible improvements.  The first thought is to treat games below some cutoff (e.g., games one by less than 4 points) as ties, on the notion that such games may not tell us anything about which team is better.  So let's have a look:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
IMOV (mov-cutoff=2.0) 72.3%11.07
IMOV (mov-cutoff=4.0)72.1%11.13

Not much to work with there; any cutoff seems to worsen performance.  We can also take a look at the other end of the MOV spectrum, and restrict the benefit of blowout wins.  There are two thoughts here -- first, that winning by 40 is no better indicator than winning by 30, and second, that rating systems should discourage "Running Up The Score" (RUTS).  We don't care about the latter, but perhaps the former makes this a good idea.  We can test this by clamping all scores above a threshold to the threshold (i.e., we treat every score above +30 as 30).  Let's see:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
IMOV (clamp = 30) 71.6%11.59
IMOV (clamp = 40)72.7%11.05

There are very few games with MOV > 40, so there is little change at that level. Any stronger level of clamping reduces performance.

If IMOV ends up being the "best of breed" for MOV-based ratings I may revisit it with some additional tweaks, but for now we'll stop here.  A couple of interesting results from these experiments:  (1) the iterative framework seems to outperform the "Random Walker" framework regardless of the evaluation function (on second thought, let's wait on that conclusion...), (2) pruning the training data to eliminate close games, reduce the effects of blowouts, etc., almost never improves performance, and  (3) scaling MOV in the various ways offered by ISOV does not improve performance.

Tuesday, June 7, 2011

Google's Prediction API

Anyone out there up for trying out Google's Prediction API on NCAA basketball?

From the Maker's Faire

Peter Tu of GE Global Research was at the San Francisco Maker's Faire this weekend and mentions a talk by the guys who did the viral Coke & Mentos video:
Their method follows the 1-10-100 principle. It takes one experiment to spark a concept. By experiment 10 one should have fleshed things out and have defined a direction. By experiment 100 one hopes to have found something that is sublime… The four rules that they espouse are: 1) seek variation – explore the possibilities. 2) be obsessive – keep focused until one finds something special.  3) be stubborn – don’t give up until you work through the problems. 4) set limits and work within them – unconstrained innovation meanders and wonders, only by setting limits does it force one to dive into the depths of a concept. Their thoughts are somewhat reminiscent of “Zen and the Art of Motorcycle Maintenance”, where the key idea is to have an obsession with quality and to always have a good pot of coffee close at hand.
 This reminded me of someone...  Although I doubt that I'm going to end up with a viral YouTube video :-)

Monday, June 6, 2011

A Note on Pace

Just a quick note to explain the slow pace of posting...   Some of the iterative rating systems recalculate after every game, and can take upwards of two hours to prepare the test data set  (which contains ~10K games).  So when I hit an interesting system and am experimenting with different variants, it can take quite a while (what with real life interfering :-) to put together an assessment.  I could post results piecemeal, but I think it makes more sense to wait until I have a complete assessment.

Wednesday, June 1, 2011

Logistic Regression/Markov Chain (LRMC)

The next MOV-based algorithm we'll look at is the Logistic Regression/Markov Chain (LRMC) model.  LRMC was developed by Joel Sokol and Paul Kvam at Georgia Tech.  It has gotten some press for being the best predictor of NCAA Tournament success over the past few years.  In 2010 it got 51/63 games correct, better than any other predictor.  (ISOV, a version of Iterative Strength Rating that uses MOV, was second.)

Sokol and Kvam have written several papers describing the LRMC model (one is available here).  The basic notion is similar to the Random Walkers model.  Each team has a certain number of votes, and in each iteration we move some of those votes to other teams, based upon that team's past performance.  In the Random Walkers model, we move votes based upon whether a team won or lost a game.  In LRMC, we move votes based upon the margin of victory.

In [Sokol 2006], the authors derive the following function to estimate the probability that Team A will beat Team B on a neutral site given that A beat B by "x" points on A’s court:

RH(x) = exp(0.292x-0.6228) / (1 + exp(0.292x-0.6228)
(The numeric factors in this equation were derived from analyzing home vs. home matchups using a logistic regression -- the details appear in the paper.)

If we realize that "Team A will beat Team B on a neutral site" means the same thing as "Team A is better than Team B", then RH(x) gives us the probability that A is really better than B.  We then use this probability to move "votes" between the two teams.

Of course, few NCAA basketball games take place on a neutral court, so we have to adjust our calculation to account for the HCA.  [Sokol 2006] calculates the HCA at 10.5 points (a large value not in line with other analysts; we'll return to this in a moment), so we have to take away the HCA when calculating the RH(x) for the home team.  If A beat B by 15 points at home, then RH(15-10.5) = 0.530, and A gets 53% of the "votes" that ride on this game.

If we plug RH(x) into our Random Walkers model, we get this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
LRMC [2006]71.3%11.65

This performance in on par with standard RPI.  One concern with this approach is that even if the home team wins by 40 points, it can only garner about 70% of the "votes" because the exponential function tails off very slowly.  Most college basketball fans would probably consider a win by 40 points near-certain proof that Team A was better than Team B.  So rather than give the away team a floor of 30%, we can split the remaining 30%, or even assign it all to the home team.  These approaches produce this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
LRMC [2006]71.3%11.65
LRMC [2006] +15% to home 70.5%11.62
LRMC [2006] +30% to home66.8%12.57

Neither of these proves to be an improvement.

[Sokol 2010] experimented with replacing the "RH" function derived by logistic regression with other models, and found that an empirical Bayes model was better.  (Technically, that makes the name LRMC no longer appropriate.)  Part of the motivation for this change was that the 10.5 point home advantage found in the logistic regression model was considerably different than the estimates of HCA by everyone else.  With the empirical Bayes model, the HCA is determined to be in the range 2-4, in line with other estimates.  The RH function for the new model is:

RH(x) = phi(0.0189x-0.0756) 
Plugging this into our Random Walkers model gives this performance:

  Predictor    % Correct    MOV Error  
TrueSkill + iRPI72.9%11.01
LRMC [2006]71.3%11.65
LRMC [2010]71.8%11.40

This does prove to be an advantage over the original RH(x) function, but still not competitive with our best non-MOV predictor.