Most of my (limited) free time is being spent porting Net Prophet to Python, but Eric Forseth sent an email asking about the impending shot clock change that I thought was interesting. For those who might not realize it yet, the NCAA is changing the shot clock this season from 35 seconds to 30 seconds (along with some other rule changes). The big question is whether this will have a big impact on the accuracy of game predictions.

My guess is that it won't. First of all, there are a limited number of teams (Virginia comes to mind) that intentionally play to use most of the 35 second clock. Second, I think most college coaches teach an offensive strategy that says (roughly) "Play our offense until there are 10 seconds left on the clock, then take the first decent shot opportunity." So while the reduced shot clock might cause that second phase to kick in more quickly, it won't be a fundamental difference in how teams play. Third, the modified rules were used in the 2014-2015 NIT and no real differences were observed. (However, that's a very small sample set of 31 games.)

But regardless of my opinion, I think there's a couple of interesting things you can do with your predictor to see how sensitive it would be to this sort of change and/or make it more robust.

First of all you have to realize that if teams play faster or differently in the coming season, that will show up in the game data. So the real question is whether the model you build on previous seasons that used different rules will work well for games played under the new rules.

One way you can investigate this question is to look at how well your model did predicting past games that you think are "most like" the games will be under the new rules. In this case, we might guess that average time of possession will go down in the new season, so we could look at how our model has worked historically on games where the teams had less-than-average times of possession. (That is, where both teams played at a "faster" pace.) If it does an acceptable job predicting those sorts of games, we have some assurance that it will be okay under the new rules.

(As an aside, this sort of analysis is a worthwhile effort even when there isn't an impending rule change. Slicing and dicing the games in different ways and looking at how your model works or doesn't can be very enlightening. You might find out that your model doesn't work on Fridays.)

You can also flip this on its head and try constructing a model using only the fast-paced historical games and see if that model performs better than the more general model. If it does, you might want to consider using the specialized model in the upcoming season (at least provisionally until you see how it actually performs).

If your model is sensitive to game pace (or any other game characteristic), how can you make it more robust? One possibility is to transform the source data (i.e., the statistics used to make predictions) to make it independent of the game pace. One possibility is to use standard scores.

The idea of a standard score is to transform a statistic from a raw number into the number of standard deviations that number is above (or below) the mean. So instead of "Duke averages 8 offensive rebounds a game" you use something like "Duke's offensive rebounding rate is 0.25 standard deviations above the mean". In theory, this approach permits direct "apples to apples" comparisons between seasons where the distribution has changed.

In practice, there are a couple of potential issues with this approach. First, the standard score transformation assumes that the underlying statistic follows a normal distribution. In my experience, this is usually true for NCAA college basketball statistics, but if the statistic does not follow a normal distribution applying this transform may introduce error. Second, the standard score is a non-linear transformation. If you use a linear machine learning model (e.g., a linear regression), this may impact your results -- possibly for the better, but possibly for the worse. Finally, one must be careful in setting the mean and standard deviation for the transformation not to "leak" future information into historical games. For example, it might seem reasonable to transform all games from the 2014 season using the season mean and standard deviation. But this would be an error, because at the beginning of the season you don't know the mean and standard deviation for the whole season.

Another possible transform is to use game ratios rather than raw statistics. In this approach, all the inputs to your model are expressed as ratios between the two competing teams. Instead of "Duke averages 8 offensive rebounds a game" and "Wake Forest averages 6 offensive rebounds a game" you use "Duke rebounds 1.33 times as well as Wake Forest." Like standard score, this is a non-linear transformation but avoids the need to estimate a mean and standard deviation.

Just an FYI, the shot clock is only going from 35 to 30. It's been many years since they moved from 45 down to 35.

ReplyDeleteWhoops, I meant 35.

ReplyDeleteGoes to show how old I am :-)

ReplyDelete