For most statistics, I also produce the average for the team's opponents. So to continue the example above, I produce "Average free throws per game for this team's opponents." I also produce a small number of simple derived statistics, such as "Average Margin of Victory (MOV)", and winning percentages at home and on the road.

When we get to predicting game outcomes, of course we have all of these statistics for both the home and the away team. (And that home/road distinction is important, obviously.) If we use all these base statistics to create a linear regression, we get the following performance:

Predictor | % Correct | MOV Error |
---|---|---|

Base Statistical Predictor | 72.3% | 11.10 |

This is the same performance I have reported earlier, and tracks fairly well with the best performance from the predictors based upon strength ratings.

Now we want to augment that predictor with derived statistics to see if they offer any performance improvement. As mentioned last time, we have 1200 derived statistics, so we have to do some feature selection to thin that crop for testing.

One possibility (as discussed here) is to build a decision tree, and use the features identified in the tree. If we do that (and force the tree to be small), we identify these derived features as important:

*The home team's average margin of victory per possession over the overall winning percentage**The away team's average number of field goals made by opponents over average score**The home team's average assists by opponents over the field goals made**The home teams average MOV per game over the home winning percentage*

To test that, I add those statistics to my base statistics and re-run the linear regression. In this case, what I find is that while some of the derived statistics are identified as having high value by the linear regression, the overall performance does not improve.

There are other methods for feature selection, of course. RapidMiner has an extension focused solely on feature extension. This offers a variety of approaches, including selecting based on Maximum Relevance, Correlation-Based Feature Selection, and Recursive Conditional Correlation Weighting. All of these methods identified "important" derived statistics, but none produced a set of features that out-performed the base set.

A final approach is a brute force approach called forward search. In this approach, we start with the base set of statistics, add each of the derived statistics in turn, and test each combination. If any of those combinations improve on the base set, we pick the best combination and repeat the process. We continue this way until we can find no further improvement.

There are a couple of advantages to this approach. First, there's no guessing about what features will be useful -- instead we're actually running a full test every time and determining whether a feature is useful or not. Second, we're testing all combinations in our search space, so we know we'll find the best combination. The caveat here is that we assume that improvement is monotonic with regards to adding features. If the best feature set is "A, B, C" then we're assuming we can find that by adding A first (because it offers the most improvement at the first step), then B to that, and so on. That isn't always true, but in this case it seems a reasonable assumption.

The big drawback of this approach is that it is very expensive. We have to try lots of combinations of features, and we have to run a full test for each combination. In this case, the forward search took about 54 hours to complete -- and since I had to run it several times because of errors or tweaks to the process in ended up taking about a solid week of computer time.

In the end, the forward search identified ten derived features, with this performance:

Predictor | % Correct | MOV Error |
---|---|---|

Base Statistical Predictor | 72.3% | 11.10 |

w/ Forward Search Features | 74.0% | 10.73 |

This is a fairly significant improvement. The most important derived features in the resulting model were:

*The away team's opponent scoring average over the away team's winning percentage.**The away team's offensive rebounding average over the away team's # of field goals attempted**The away team's scoring average over the away team's winning percentage**The away team's opponent treys attempted over the away team's rebounds*

I'll leave it to the reader to contemplate the meaning of these statistics, but there are some interesting suggestions here. The first and third statistics seem to be saying something about whether the away team is winning games through defense or offense. The second and fourth statistics seem to be saying something about rebounding efficiency, and perhaps about whether the team is good at getting "long" rebounds. (The statistics for the home team are completely different, by the way.)

Next time I'll begin looking at a different set of derived statistics.

## No comments:

## Post a Comment