This year has seen a number of new predictors join the ranks at The Prediction Tracker. One of those is ERFunction Ratings, the work of Eric Forseth. I was lucky enough to exchange a few emails with Eric last summer when he started work on his predictor, and I thought others might also be interested in his approach, so I asked him to do a guest posting for me. (He was also very flattering to me on his website, which might have influenced me. :-) If there's interest, I may reach out to some of the other prominent predictors for similar posts -- let me know in the comments. -- Scott
My name is Erik Forseth, and I've been posting picks for a few weeks now at ERFunction Ratings. I'm a grad student in the physics department at UNC-Chapel Hill, where I work on analytic and computational approximations to Einstein's general theory of relativity. Unfortunately, I haven't thought of any novel ways to adapt techniques from my physics research to the problem of predicting college basketball games; this is just a hobby that I got into as a sports fan and a programmer!
In what follows, I'll attempt to describe my model in its present state. So far it has been admittedly mediocre as a straight up win/loss predictor, but quite good at predicting the margin of victory (MOV); as of this writing, my model is second overall in MOV error measures on the Prediction Tracker -- or first if you count only predictors which have logged at least 40 games ;). I am currently working on ways to improve the W/L percentage, and I suspect this will be a work-in-progress for some time. I do, however, plan to use the current version to pick games for the rest of this 2014-2015 regular season.
Broadly, the model is just a linear regression formula for MOV, trained on about 16,000 non-neutral court, non-tournament games from the past seven seasons. It takes as input a relatively small set of variables for the home and away teams.
First and foremost is each team's RPI. Scott has written extensively about various RPI implementations here on Net Prophet, and the version I like most is the "infinitely deep" RPI. At the "zeroth" order, each team's RPI is equal to its winning percentage minus 0.5 (so that winning teams start with a positive RPI and vice versa). Then, for as many iterations as are needed for the values to converge, we add to each team's RPI the average of its opponents' RPIs. There are some technicalities, but this is the main idea. Now, as a practical matter, convergence can be a bit tricky. For one thing, my database only includes D1 teams, most of whom pad out the early part of their schedule with non-D1 teams for whom I haven't got any data. So my "league average" winning percentage is actually a bit higher than 0.5. Fine, so instead of 0.5, you can subtract the "true" league average (which is what I do), but in my experience the values still eventually run away, perhaps due to round-off.
My solution: recall the infinite geometric series
1 + 1/2 + 1/4 + ... = 2.
I simply tack on a factor of 1/2 to each successive correction, and divide the final result by 2, so that every order gets a decreasing coefficient, the sum of all coefficients being equal to 1. The decision to use 1/2 at every order is sort of arbitrary, and we might worry that we're undervaluing the higher corrections. I've played with this a fair bit, and found that using fractions larger than 1/2 doesn't seem to improve the model, and in fact, above a certain point the model actually gets worse (perhaps the numbers get too muddled together). 1/2 may not be ideal, but it seems to work fine, and the formula converges quickly enough.
The next set of independent variables in the regression are the home/away teams' average MOVs. This is pretty self-explanatory, and I don't think I need to elaborate!
The last two independent variables for each team are versions of offensive efficiency (OE) and defensive efficiency (DE), or points scored per possession and points allowed per possession, respectively. These are weighted using an algorithm very much like the one for RPI just described; however, these stats need to be treated a little differently than winning percentage. Coincidentally, Scott recently described the issues with this. It just doesn't make a lot of sense to measure a team's OE, and then weight it by adding the team's opponents' OEs (except maybe in the broad sense that the OE is another measure, like winning percentage, of how "good" a team is).
Instead, we'd like to weight a team's OE using its opponents' DEs, and vice versa. A high OE is good, whereas a high DE is bad. So, let's say a team initially has a high OE, but this number becomes very small (or negative) when we subtract the average of its opponents DEs. This could be an indication that the team is not so much an offensive juggernaut as it is simply playing teams who give up lots of points. If, on the other hand, the OE is still high after subtracting the opponents' DE average, then this might be an indication that the team scores well in spite of its opponents' generally giving up few points. And so, as with RPI, I compute these iteratively, tacking on a coefficient of 1/2 at each step.
(Confession: at some point, I actually tried separately weighting OEs by OEs and DEs by DEs, rather than the mixed subtractive procedure described above. In other words, treating them like winning percentage. And wouldn't you know, in the end it actually works really well. The resulting weighted offensive and defensive efficiencies are better predictors of MOV than the ordinary, un-weighted versions. In fact, in cross-validation, that version of the model was very nearly just as good as my current version. I think I sort of understand why this is -- besides whichever method is "formally" correct, several approaches end up working -- but let's not worry about the details for now.)
And that's it! RPI, average MOV, and "RPI-like" weighted offensive/defensive efficiencies are all that go into the model. The intercept, which ostensibly comprises the home-court advantage, is around 3.7 - 3.8. Though I haven't always been systematic about it, I've experimented by including all sorts of basic and derived stats on top of this set, with no or only very marginal improvement (although to be fair, marginal improvements are valuable here). There are many ways to skin the cat, but this seems to be a minimal set of variables which still encode a lot of information about the quality of the teams.
From now until March, it's time to focus on a tournament predictor....