Tuesday, April 26, 2011

RPI Recap

I want to take a posting to recap the various approaches we've looked at over the past few weeks.  Although I've been applying them to the RPI algorithm, I believe they'll continue to be generally useful as we look at more complex prediction approaches.

(1) Measure

The initial insight was to have a clear notion of what we're trying to achieve and then select objective metrics to measure progress.  The importance of this was repeatedly evident as we looked at various RPI tweaks.  In many cases, "obvious" improvements to RPI turned out to be no improvement at all.  In another case, we corrected a math error in RPI that turned out not be an error at all (or at any rate didn't improve predictive performance).

There is a proven and significant home court advantage in college basketball.  We looked at several ways to account for HCA, but in the end our predictive model captured it better than we could with an apriori solution.  For example, the linear regression for one of our RPI tweaks looked like this:
MOV = 85.414 * Hrpi - 80.508 * Arpi + 1.580
The different coefficients for the home team's RPI (85.414) and the away team's RPI (80.508), as well as the constant bias (1.580) combine to model the home court advantage.

The lesson here is that the HCA is important to accurate prediction, and we need to ensure that either our model accommodates it naturally (as in the case above) or that we otherwise account for it in our data.  For the latter case we looked at a number of possible tools: weighting the home record differently, applying a point bias to the home team, or splitting a team into a "home team" and "away team" component.

(3) Strength of Opponent

One of the paradoxical challenges of assessing a team's strength is that you need to know the strength of the teams it has played.  It's a classic Catch-22.

The RPI approach to breaking this death spiral is to base the strength metric on some other measure.  This is what RPI does -- RPI tells us how strong a team is, but RPI itself is eventually dependent on won-loss records.  By recursively finding the winning percentage for opponents, and opponents' opponents, RPI tries to estimate the true strength of a team.

One useful improvement we found on this technique is to carry out this recursion "infinitely" by using an iterative solution.  We can use another measure as an initial estimate of our strength metric, and then iteratively adjust the metric until we get the values that best match the actual performance so far.

(4) Data Filtering

Our prediction algorithms are based largely on past performance.  One approach we used with some success for RPI was to filter the past performance to eliminate games that reduced our prediction accuracy.  In the case of RPI, we found in it was useful to eliminate games where the MOV was 1 point.  In general, we may want to look at a variety of different filtering approaches (e.g., eliminate blow-out games, pre-season tournament games, etc.).

(5) Modeling Changing Performance

As we build predictive models, we have to consider whether a team's performance changes substantially over the course of a season.  (It clearly does so from season to season.)  If it does, then our predictive accuracy might be better if we discount older games when building our models.  Another intriguing possibility is whether we can identify specific events where a team's performance changed substantially, e.g., when we notice the minutes played by specific players changes signficantly due to an injury or other reason.

In the case of RPI, we weren't able to improve our accuracy by weighting recent games, but other methods or different approaches may yet prove valuable.

Next up we'll start taking a look at some alternative methods for rating teams that use only won-loss records.  There are a number of candidates, and we'll be looking to see if any of them  provide a significant advantage over our (tweaked) RPI.