Thursday, January 29, 2015

Guest Post: Eric Forseth, aka ERFunction Ratings

This year has seen a number of new predictors join the ranks at The Prediction Tracker.  One of those is ERFunction Ratings, the work of Eric Forseth.  I was lucky enough to exchange a few emails with Eric last summer when he started work on his predictor, and I thought others might also be interested in his approach, so I asked him to do a guest posting for me.  (He was also very flattering to me on his website, which might have influenced me. :-)  If there's interest, I may reach out to some of the other prominent predictors for similar posts -- let me know in the comments.  -- Scott

My name is Erik Forseth, and I've been posting picks for a few weeks now at ERFunction Ratings. I'm a grad student in the physics department at UNC-Chapel Hill, where I work on analytic and computational approximations to Einstein's general theory of relativity. Unfortunately, I haven't thought of any novel ways to adapt techniques from my physics research to the problem of predicting college basketball games; this is just a hobby that I got into as a sports fan and a programmer!

In what follows, I'll attempt to describe my model in its present state. So far it has been admittedly mediocre as a straight up win/loss predictor, but quite good at predicting the margin of victory (MOV); as of this writing, my model is second overall in MOV error measures on the Prediction Tracker -- or first if you count only predictors which have logged at least 40 games ;). I am currently working on ways to improve the W/L percentage, and I suspect this will be a work-in-progress for some time. I do, however, plan to use the current version to pick games for the rest of this 2014-2015 regular season.  

Broadly, the model is just a linear regression formula for MOV, trained on about 16,000 non-neutral court, non-tournament games from the past seven seasons. It takes as input a relatively small set of variables for the home and away teams. 

First and foremost is each team's RPI. Scott has written extensively about various RPI implementations here on Net Prophet, and the version I like most is the "infinitely deep" RPI. At the "zeroth" order, each team's RPI is equal to its winning percentage minus 0.5 (so that winning teams start with a positive RPI and vice versa). Then, for as many iterations as are needed for the values to converge, we add to each team's RPI the average of its opponents' RPIs. There are some technicalities, but this is the main idea. Now, as a practical matter, convergence can be a bit tricky. For one thing, my database only includes D1 teams, most of whom pad out the early part of their schedule with non-D1 teams for whom I haven't got any data. So my "league average" winning percentage is actually a bit higher than 0.5. Fine, so instead of 0.5, you can subtract the "true" league average (which is what I do), but in my experience the values still eventually run away, perhaps due to round-off. 

My solution: recall the infinite geometric series

1 + 1/2 + 1/4 + ... = 2.

I simply tack on a factor of 1/2 to each successive correction, and divide the final result by 2, so that every order gets a decreasing coefficient, the sum of all coefficients being equal to 1. The decision to use 1/2 at every order is sort of arbitrary, and we might worry that we're undervaluing the higher corrections. I've played with this a fair bit, and found that using fractions larger than 1/2 doesn't seem to improve the model, and in fact, above a certain point the model actually gets worse (perhaps the numbers get too muddled together). 1/2 may not be ideal, but it seems to work fine, and the formula converges quickly enough. 

The next set of independent variables in the regression are the home/away teams' average MOVs. This is pretty self-explanatory, and I don't think I need to elaborate!

The last two independent variables for each team are versions of offensive efficiency (OE) and defensive efficiency (DE), or points scored per possession and points allowed per possession, respectively. These are weighted using an algorithm very much like the one for RPI just described; however, these stats need to be treated a little differently than winning percentage. Coincidentally, Scott recently described the issues with this. It just doesn't make a lot of sense to measure a team's OE, and then weight it by adding the team's opponents' OEs (except maybe in the broad sense that the OE is another measure, like winning percentage, of how "good" a team is). 

Instead, we'd like to weight a team's OE using its opponents' DEs, and vice versa. A high OE is good, whereas a high DE is bad. So, let's say a team initially has a high OE, but this number becomes very small (or negative) when we subtract the average of its opponents DEs. This could be an indication that the team is not so much an offensive juggernaut as it is simply playing teams who give up lots of points. If, on the other hand, the OE is still high after subtracting the opponents' DE average, then this might be an indication that the team scores well in spite of its opponents' generally giving up few points. And so, as with RPI, I compute these iteratively, tacking on a coefficient of 1/2 at each step. 

(Confession: at some point, I actually tried separately weighting OEs by OEs and DEs by DEs, rather than the mixed subtractive procedure described above. In other words, treating them like winning percentage. And wouldn't you know, in the end it actually works really well. The resulting weighted offensive and defensive efficiencies are better predictors of MOV than the ordinary, un-weighted versions. In fact, in cross-validation, that version of the model was very nearly just as good as my current version. I think I sort of understand why this is -- besides whichever method is "formally" correct, several approaches end up working -- but let's not worry about the details for now.)

And that's it! RPI, average MOV, and "RPI-like" weighted offensive/defensive efficiencies are all that go into the model. The intercept, which ostensibly comprises the home-court advantage, is around 3.7 - 3.8. Though I haven't always been systematic about it, I've experimented by including all sorts of basic and derived stats on top of this set, with no or only very marginal improvement (although to be fair, marginal improvements are valuable here). There are many ways to skin the cat, but this seems to be a minimal set of variables which still encode a lot of information about the quality of the teams. 

From now until March, it's time to focus on a tournament predictor....


  1. Just a note on home-court advantage (HCA): while I pointed out that the intercept ought to sort of encapsulate it, one can get another idea by looking at the mean home MOV over a large set of games. In my data, this is around 4.69. So, some of the HCA is encoded in the coefficients of the model.

    You might try subtracting the mean from the model in order to get a "neutral court" version, but this still doesn't quite cover it. If you look at, for example, a histogram of MOV data, you'll see that it's weighted more toward the positive side; it's not a symmetric graph that just happens to be shifted by 4.7. When home teams lose, they tend not to lose by as many points.

  2. So your model essentially uses win % (adjusted), MOV (UNadjusted), and off/def efficiencies (adjusted)

    In other words:
    1) points scored >? points allowed (with opponents accounted for)
    2) points scored - points allowed (no opponents)
    3) points scored (with opponents)
    4) points allowed (with opponents)

    #2 is just a combination of 3 and 4, but for some reason without an opponent adjustment (and not per possession). #1 is also just a combination of 3 and 4, and the only new information contained would be teams that have a skill to distribute their points better to win more than expected (which is just randomness in all likelihood).

    So i fail to see what value #1 and 2 provide, which makes the model an opponent-adjusted off/def efficiency model. That's likely good enough to capture most of what you need but I think the other stuff is just superfluous and confusing.

    1. Well, look, I can't argue with you here. I totally agree with your gut assessment of my choices, because I had the same initial impressions. Variables #3 and #4 ought to determine variable #2.. On top of that, variable #2 is (inexplicably) not adjusted like the others. But these choices aren't arbitrary or necessarily intuitive. Simply put, I chose statistically significant variables which improved my RMS MOV error in cross-validation -- intuition aside.

      I tried adjusting MOV like the other variables, for example, and it just doesn't work as well as a statistical predictor of future MOV.

      I think one of the things you'll learn, if you actually play with this data, is that intuition isn't always the best guide. Of course I wondered what "new" information was contained every time I added variables, or what information "ought" to be meaningful. But the only real ways to do this are to adjust for co-linear features and assess error measures on common training/data sets. I wish this model were "sexier" and more aligned with intuition, but this combination of variables, weighted or unweighted, were not settled on arbitrarily. They are the products of a lot of experimentation.

      At the end of the day, the proof is in the pudding.

  3. Interesting discussion. I totally understand Monte's point, but my own experience agrees with Eric's. I have many stats in my model which seem to be redundant, but which improve performance. It may be that the different transforms expose different aspects of the underlying behavior. The same way it might be useful to have both a raw statistic and the log transform of the statistic.

    Monte, I think you should be next up for a guest post! :-)

  4. To be fair, I actually did set out to write a straightforward opponent-adjusted off/def efficiency model, and then found that what we're calling variables #1 and #2 improved performance more than any others that I tried. Nevertheless, I had hang-ups about including them, for the very reasons Monte describes. So the comment is pretty on-the-money.

    Also, I think it's fair to say I misjudged it some at first, and that probably it will be safe for me to assume, from now, that commenters on Net Prophet have "played with the data." Oops :)