Wednesday, October 24, 2012

A Closer Look at Early Season Prediction Performance

In the previous post, I looked at predicting early season games using my standard predictive model and found that performance was (understandably) much worse for the early season games where teams had no history of performance than in late season games, where we had the whole season's history to help guide the prediction.  I also looked at using the previous season's games to "prime the pump" and found that improved performance considerably.  In this post, I'll take a closer look at those two cases.

The graph above plots the prediction error for a moving twenty game window throughout the first 1000 games of the season.  (Note #1: The twenty game window is arbitrary -- but the data looks the same for other window sizes.  Note #2: This drops the first game for every team.  The model predicts a visiting team win by 224 points for those games, which greatly distorts the data.)  The green line is a linear regression to the data.  The prediction error starts out high (15+) and drops steadily throughout the 1000 games until at the end, it is close to the performance of the model for the rest of the season.

(There are some interesting aspects to this graph.  For example, much of the error seems to be driven by a few games.  For example, the peak at around 225 games is driven largely by two matchups: Georgetown vs. NC Greensborough and Colorado State vs. SMU.  In both cases, the predictor has an unrealistic estimate of the strength of one or both of the teams.  So it might be that we could greatly improve prediction by identifying those sorts of games and applying some correction.  A possible topic for another day.)

A logarithmic regression suggests that much of the error is eliminated during the first 500 games:

If nothing else, this plot suggests that even with no other measures, our predictions should be pretty good after about the 500th game.  Now let's take a look at a similar plot for predictions where the teams have been primed with the earlier season's games:

Huh!  The use of the previous season's games pins the predictive performance to about 12 RMSE.  It's easy to understand why.  The previous season's performance has decent predictive power -- certainly better than no data at all -- but swamps the current season's performance, preventing the predictor from improving.  Even by the end of the 1000 game period, most teams have only played 5 or 6 games.  The previous season's 30+ games simply out-weigh this season's games too much to let the performance improve.

We can plot the two trendlines to see where it stops paying off to use the primed data predictions:

The cutoff is around 800 games (if we include the first game for every team).  We can combine these two into a predictor that gradually switches over from one predictor to the other over the first 800 games.  That predicts games with about the same error rate as using the previous season's data -- the last 200 games are predicted better, but not enough to substantially move the average.

More to come.

(Incidentally, this is the 100th blog posting!)


  1. Congrats on the 100th.

    I definitely enjoy every single one of them.
    Thank you do much for writing/sharing.


  2. Thanks, Pims! You and my three other dedicated followers make it all worthwhile :-)