*This year has seen a number of new predictors join the ranks at The Prediction Tracker. One of those is ERFunction Ratings, the work of Eric Forseth. I was lucky enough to exchange a few emails with Eric last summer when he started work on his predictor, and I thought others might also be interested in his approach, so I asked him to do a guest posting for me. (He was also very flattering to me on his website, which might have influenced me. :-) If there's interest, I may reach out to some of the other prominent predictors for similar posts -- let me know in the comments. -- Scott*

My name is Erik Forseth, and I've been posting picks for a few weeks now at ERFunction Ratings.
I'm a grad student in the physics department at UNC-Chapel Hill, where I
work on analytic and computational approximations to Einstein's general
theory of relativity. Unfortunately, I haven't thought of any novel
ways to adapt techniques from my physics research to the problem of
predicting college basketball games; this is just a hobby that I got
into as a sports fan and a programmer!

In what
follows, I'll attempt to describe my model in its present state. So far
it has been admittedly mediocre as a straight up win/loss predictor, but
quite good at predicting the margin of victory (MOV); as of this
writing, my model is second overall in MOV error measures on the Prediction Tracker -- or first
if you count only predictors which have logged at least 40 games ;). I
am currently working on ways to improve the W/L percentage, and I
suspect this will be a work-in-progress for some time. I do, however,
plan to use the current version to pick games for the rest of this
2014-2015 regular season.

Broadly, the model
is just a linear regression formula for MOV, trained on about 16,000
non-neutral court, non-tournament games from the past seven seasons. It
takes as input a relatively small set of variables for the home and away
teams.

First and foremost is each team's RPI.
Scott has written extensively about various RPI implementations here on
Net Prophet, and the version I like most is the "infinitely deep"
RPI. At the "zeroth" order, each team's RPI is equal to its winning
percentage minus 0.5 (so that winning teams start with a positive RPI
and vice versa). Then, for as many iterations as are needed for the
values to converge, we add to each team's RPI the average of its
opponents' RPIs. There are some technicalities, but this is the main idea. Now,
as a practical matter, convergence can be a bit tricky. For one thing,
my database only includes D1 teams, most of whom pad out the early part
of their schedule with non-D1 teams for whom I haven't got any data. So
my "league average" winning percentage is actually a bit higher than
0.5. Fine, so instead of 0.5, you can subtract the "true" league average
(which is what I do), but in my experience the values still eventually
run away, perhaps due to round-off.

My solution: recall the infinite geometric series

1 + 1/2 + 1/4 + ... = 2.

I
simply tack on a factor of 1/2 to each successive correction, and
divide the final result by 2, so that every order gets a decreasing
coefficient, the sum of all coefficients being equal to 1. The decision
to use 1/2 at every order is sort of arbitrary, and we might worry that
we're undervaluing the higher corrections. I've played with this a fair
bit, and found that using fractions larger than 1/2 doesn't seem to
improve the model, and in fact, above a certain point the model actually
gets worse (perhaps the numbers get too muddled together). 1/2 may not
be ideal, but it seems to work fine, and the formula converges quickly
enough.

The next set of independent variables
in the regression are the home/away teams' average MOVs. This is pretty
self-explanatory, and I don't think I need to elaborate!

The
last two independent variables for each team are versions of offensive
efficiency (OE) and defensive efficiency (DE), or points scored per
possession and points allowed per possession, respectively. These are
weighted using an algorithm very much like the one for RPI just
described; however, these stats need to be treated a little differently
than winning percentage. Coincidentally, Scott recently described the issues with
this. It just doesn't make a lot of sense to measure a team's OE, and
then weight it by adding the team's opponents' OEs (except maybe in the
broad sense that the OE is another measure, like winning percentage, of
how "good" a team is).

Instead, we'd like to
weight a team's OE using its opponents' DEs, and vice versa. A high OE
is good, whereas a high DE is bad. So, let's say a team initially has a
high OE, but this number becomes very small (or negative) when we
subtract the average of its opponents DEs. This could be an indication
that the team is not so much an offensive juggernaut as it is simply
playing teams who give up lots of points. If, on the other hand, the OE
is still high after subtracting the opponents' DE average, then this
might be an indication that the team scores well in spite of its
opponents' generally giving up few points. And so, as with RPI, I
compute these iteratively, tacking on a coefficient of 1/2 at each
step.

(Confession: at some point, I actually
tried separately weighting OEs by OEs and DEs by DEs, rather than the
mixed subtractive procedure described above. In other words, treating
them like winning percentage. And wouldn't you know, in the end it
actually works really well. The resulting weighted offensive and
defensive efficiencies are better predictors of MOV than the ordinary,
un-weighted versions. In fact, in cross-validation, that version of the
model was very nearly just as good as my current version. I think I sort
of understand why this is -- besides whichever method is "formally"
correct, several approaches end up working -- but let's not worry about
the details for now.)

And that's it! RPI,
average MOV, and "RPI-like" weighted offensive/defensive efficiencies
are all that go into the model. The intercept, which ostensibly
comprises the home-court advantage, is around 3.7 - 3.8. Though I
haven't always been systematic about it, I've experimented by including
all sorts of basic and derived stats on top of this set, with no or only
very marginal improvement (although to be fair, marginal improvements
are valuable here). There are many ways to skin the cat, but this seems
to be a minimal set of variables which still encode a lot of information
about the quality of the teams.

From now until March, it's time to focus on a tournament predictor....

Just a note on home-court advantage (HCA): while I pointed out that the intercept ought to sort of encapsulate it, one can get another idea by looking at the mean home MOV over a large set of games. In my data, this is around 4.69. So, some of the HCA is encoded in the coefficients of the model.

ReplyDeleteYou might try subtracting the mean from the model in order to get a "neutral court" version, but this still doesn't quite cover it. If you look at, for example, a histogram of MOV data, you'll see that it's weighted more toward the positive side; it's not a symmetric graph that just happens to be shifted by 4.7. When home teams lose, they tend not to lose by as many points.

So your model essentially uses win % (adjusted), MOV (UNadjusted), and off/def efficiencies (adjusted)

ReplyDeleteIn other words:

1) points scored >? points allowed (with opponents accounted for)

2) points scored - points allowed (no opponents)

3) points scored (with opponents)

4) points allowed (with opponents)

#2 is just a combination of 3 and 4, but for some reason without an opponent adjustment (and not per possession). #1 is also just a combination of 3 and 4, and the only new information contained would be teams that have a skill to distribute their points better to win more than expected (which is just randomness in all likelihood).

So i fail to see what value #1 and 2 provide, which makes the model an opponent-adjusted off/def efficiency model. That's likely good enough to capture most of what you need but I think the other stuff is just superfluous and confusing.

Well, look, I can't argue with you here. I totally agree with your gut assessment of my choices, because I had the same initial impressions. Variables #3 and #4 ought to determine variable #2.. On top of that, variable #2 is (inexplicably) not adjusted like the others. But these choices aren't arbitrary or necessarily intuitive. Simply put, I chose statistically significant variables which improved my RMS MOV error in cross-validation -- intuition aside.

DeleteI tried adjusting MOV like the other variables, for example, and it just doesn't work as well as a statistical predictor of future MOV.

I think one of the things you'll learn, if you actually play with this data, is that intuition isn't always the best guide. Of course I wondered what "new" information was contained every time I added variables, or what information "ought" to be meaningful. But the only real ways to do this are to adjust for co-linear features and assess error measures on common training/data sets. I wish this model were "sexier" and more aligned with intuition, but this combination of variables, weighted or unweighted, were not settled on arbitrarily. They are the products of a lot of experimentation.

At the end of the day, the proof is in the pudding.

Interesting discussion. I totally understand Monte's point, but my own experience agrees with Eric's. I have many stats in my model which seem to be redundant, but which improve performance. It may be that the different transforms expose different aspects of the underlying behavior. The same way it might be useful to have both a raw statistic and the log transform of the statistic.

ReplyDeleteMonte, I think you should be next up for a guest post! :-)

To be fair, I actually did set out to write a straightforward opponent-adjusted off/def efficiency model, and then found that what we're calling variables #1 and #2 improved performance more than any others that I tried. Nevertheless, I had hang-ups about including them, for the very reasons Monte describes. So the comment is pretty on-the-money.

ReplyDeleteAlso, I think it's fair to say I misjudged it some at first, and that probably it will be safe for me to assume, from now, that commenters on Net Prophet have "played with the data." Oops :)