Net Prophet: early-season

Showing posts with label early-season. Show all posts

Wednesday, November 21, 2012

Another Approach to Early Season Performance

Continuing on with my efforts to better model early season performance, it occurred to me that it might be good to model a team as the average of several previous years teams. So we'd predict that Duke 2012-2013 would perform like an average of the 2009-2010, 2010-2011, and 2011-2012 teams.

This is a fairly straightforward experiment in my setup -- I just read in all three previous seasons as if they were one long preseason, and then predict the early season games. Of course, with a twelve thousand game "preseason" this takes a while -- particularly when you keep making mistakes at the end of the processing chain and have to start over again :-).

At any rate, the conclusion is that this approach doesn't work very well. The MOV error over the first thousand games was 12.60 -- worse than just priming with the previous seasons data.

Tuesday, November 13, 2012

More on Early Season Performance

Prior to my recent detour, I was looking at predicting early season performance. To recap, experiments showed that predicting early season games using the previous season's data works fairly well for the first 800 or so games of the season. However, "fairly well" in this case means an MOV error of around 12, which is better than predicting with no data, but not close to the error of around 11 we get with our best model for the rest of the season. The issue I want to look at now is whether we can improve that performance.

A reasonable hypothesis is that teams might "regress to the mean" from season to season. That is, the good teams probably won't be as good the next season, and the bad teams probably won't be as bad. This will be wrong for some teams -- there will be above-average teams that get even better, and below-average teams that get even worse -- but overall it might be a reasonable approach.

It isn't immediately clear, though, how to regress the prediction data for teams back to the mean. For something like the RPI, we could calculate the average RPI for the previous season and push team RPIs back towards that number. But for more complicated measures that may not be easy. And even for the RPI, it isn't clear that this simplistic approach would be correct. Because RPI depends upon the strength of your opponents, it might be that a team with an above-average RPI who played a lot of below-average RPI teams would actually increase its RPI because we would be pushing the RPIs of the its opponents up towards the mean.

A more promising (perhaps) approach is to regress the underlying game data rather than trying to regress the derived values like RPI. So we can use the previous season's data, but in each game we'll first reduce the score of the winning team and raise the score of the losing team. This will reduce the value of wins and the reduce the cost of losses, which should have the effect of pulling all teams back to the mean.

The table below shows the performance when scores were modified by 1%:

Predictor	% Correct	MOV Error
Early Season w/ History	75.5%	12.18
Early Season w/ Modified History	71.7%	13.49

Clearly not an improvement, and also a much bigger effect than I had expected. After all, 1% changes most scores by less than 1 point. (Yes, my predictor is perfectly happy with an 81.7 to 42.3 game score :-) So why does the predicted score change by enough to add 1+ points of error?

Looking at the model produced by the linear regression, this out-sized response seems to be caused by a few inputs with large coefficients. For example, the home team's average MOV has a coefficient of about 3000 in the model. So changes like this scoring tweak that affect MOV can have an outsized impact on the model's outputs.

With that understood, we can try dialing the tweak back by an order of magnitude and modify scores by 0.1%:

Predictor	% Correct	MOV Error
Early Season w/ History	75.5%	12.18
Early Season w/ Modified History (0.1%)	74.8%	12.15

This does slightly improve our MOV error. Some experimenting suggests that the 0.1% is about the best we can do with this approach. The gains over just using the straight previous season history are minimal.

Some other possibilities suggest themselves, and I intend to look at them as time permits.

Wednesday, October 24, 2012

A Closer Look at Early Season Prediction Performance

In the previous post, I looked at predicting early season games using my standard predictive model and found that performance was (understandably) much worse for the early season games where teams had no history of performance than in late season games, where we had the whole season's history to help guide the prediction. I also looked at using the previous season's games to "prime the pump" and found that improved performance considerably. In this post, I'll take a closer look at those two cases.

The graph above plots the prediction error for a moving twenty game window throughout the first 1000 games of the season. (Note #1: The twenty game window is arbitrary -- but the data looks the same for other window sizes. Note #2: This drops the first game for every team. The model predicts a visiting team win by 224 points for those games, which greatly distorts the data.) The green line is a linear regression to the data. The prediction error starts out high (15+) and drops steadily throughout the 1000 games until at the end, it is close to the performance of the model for the rest of the season.

(There are some interesting aspects to this graph. For example, much of the error seems to be driven by a few games. For example, the peak at around 225 games is driven largely by two matchups: Georgetown vs. NC Greensborough and Colorado State vs. SMU. In both cases, the predictor has an unrealistic estimate of the strength of one or both of the teams. So it might be that we could greatly improve prediction by identifying those sorts of games and applying some correction. A possible topic for another day.)

A logarithmic regression suggests that much of the error is eliminated during the first 500 games:

If nothing else, this plot suggests that even with no other measures, our predictions should be pretty good after about the 500th game. Now let's take a look at a similar plot for predictions where the teams have been primed with the earlier season's games:

Huh! The use of the previous season's games pins the predictive performance to about 12 RMSE. It's easy to understand why. The previous season's performance has decent predictive power -- certainly better than no data at all -- but swamps the current season's performance, preventing the predictor from improving. Even by the end of the 1000 game period, most teams have only played 5 or 6 games. The previous season's 30+ games simply out-weigh this season's games too much to let the performance improve.

We can plot the two trendlines to see where it stops paying off to use the primed data predictions:

The cutoff is around 800 games (if we include the first game for every team). We can combine these two into a predictor that gradually switches over from one predictor to the other over the first 800 games. That predicts games with about the same error rate as using the previous season's data -- the last 200 games are predicted better, but not enough to substantially move the average.

More to come.

(Incidentally, this is the 100th blog posting!)

Thursday, October 18, 2012

Early Season Predictions, Part 2

As mentioned previously, I'm using this time before the college basketball season gets going thinking about how to predict early season games. In the early season, we're missing two elements needed for good predictions: (1) a meaningful statistical description of each team, and (2) a model that uses those statistics to predict game outcomes. By the end of the season we have both things - a good statistical characterization of each team as well as a model that has been trained on the season's outcomes. So how do we replace those two elements in the early season?

Replacing the model turns out to be fairly easy, because the factors that determine whether teams win or lose don't change drastically from season to season. When you try to predict the tournament games at the end of the season, a model trained on the previous season's games does nearly as well as a model trained on the current season's games. Of course, if the current year happens to be the year when the NCAA introduces the 3 point shot, all bets are off. Still, in my testing the best performing models are the ones trained on several previous years of data. So in the early season we can expect the model from the previous season to perform well.

(You might argue that early season predictions could be more accurate with a model specifically trained for early season games. There's some merit to this argument and I may look at this in the future.)

Replacing the team data is not so easy. The problem here is that teams have played so few games (none at all for the first game of the season) that we don't have an accurate characterization of their strengths and weaknesses. Even worse, many of the comparative statistics (like RPI) rely on teams having the same opponents to determine the relative strength of teams. In the early season, the teams don't "connect up" and in some cases, play few or no strong opponents. So how bad is it? I tested it on games from the 2011-2012 season:

Predictor	% Correct	MOV Error
Late Season Prediction	72.3%	11.10
Early Season Prediction	71.3%	15.06

So, pretty bad. It adds 4 points of error to our predictions. Since we've been groveling to pick up a tenth of a point here and there, that's a lot!

The obvious proxy for the team data is to use the team data from the previous season. Clearly this has problems -- in college basketball team performance is highly variable season to season -- but it's at least worth examining to see whether it does improve performance. In this experiment, I used the entire previous season's data to "prime the pump" for the next season. In effect, I treated the early season games as if they were being played by the previous year's team at the end of the previous season. Here are the results:

Predictor	% Correct	MOV Error
Early Season	71.3%	15.06
Early Season (w previous season)	75.5%	12.18

A fairly significant improvement. Is there anything we can do to improve the previous season's data as a proxy for this season? We'll investigate some possibilities next time.

Tuesday, September 18, 2012

Awakening From the Long Summer's Sleep

College basketball fans hibernate in the summer.

I'm slowly awakening from my March Madness-induced stupor and starting to prepare for the new season.

One of the first tasks is to look at conference realignments. My predictors don't actually use conferences for anything -- I keep thinking that conference games will have more predictive power than non-conference games, or vice versa, but to date neither has proven to be true. Nonetheless, I keep track of the conference affiliations of teams, so every Fall I have to update that data for the various conference movements.

I took my summary of the changes from the "Blogging the Bracket" here. If there's any interest in the compiled data, please let me know. I've noticed that there's been little interest in the data files I provided last year, so I won't bother unless someone expresses some interest.

The next task is to scrape the schedule of games for the season. In past seasons, I've scraped the schedule from Yahoo Sports. Unfortunately, it appears that they have "updated" their interface and broken everything. No scheduled games appear at all, and the majority of the tournament games from last year are missing as well.

Sigh.

Hopefully this is just a temporary situation while Yahoo Sports gets their bugs fixed and the data loaded. Alternate sources of this data are not easy to find. ESPN and CBS are still showing last year's games. The NCAA website started carrying game results (and box scores) last season, but doesn't seem to have the upcoming games.

In the meantime, I've been thinking about how to predict early season games. These games are difficult to predict because we do not have any history of past performance for this year's teams. So we're forced to base our predictions on other data -- or to not predict early season games (which is what I've done in past seasons). Some alternate data is only available for some of the teams (e.g., the AP preseason rankings) or is entirely subjective, which makes it less useful from my viewpoint.

One source of objective data for all the teams is their previous season's performance. One approach to predicting the early season games is to assume that teams will be just as strong this year as they were last year. Another approach might be to assume that teams will migrate towards the mean -- the best teams from last year will get a little weaker and the worst teams will get a little stronger. We could also look at team data such as the number of graduating seniors and use that information to modify the previous year's performance -- e.g., a team that lost most of its starting minutes would get weaker. An intriguing idea is to see if we can predict the change in performance for a team from season to season (based upon what factors?) and then use that to modify the previous year's performance.

As time permits, I will set up to test some of these ideas and report my findings.