In my previous post I talked about boosting. Boosting tries to improve the accuracy of a model by increasing the importance ("boosting") the training examples where the model has the most error. Most machine learning toolkits include some form of boosting (often AdaBoost). If you're not using a toolkit, it's pretty easy to manually implement a simple form of boosting.
To do this, train your classifier and then run it on all the examples of your training set. You now have a training set with predicted outcomes and actual outcomes. You can use this information to find the training examples with the most error. If you're predicted the margin of victory (MOV) of a college basketball game, those are the training examples where the prediction is most different from the actual outcome.
(As an aside, this is most straightforwardly measured as the absolute value (or square) of the difference between the prediction and the actual outcome. But you might also consider measuring this as a percentage error, if you think that being off by 1 point when you predict a 1 point victory is worse than being off 1 point when you predict a 20 point victory.)
Now you can take the worst predictions and duplicate them in the training set (so that they effectively count double) or even use them as the training set for a new predictor. (Some machine learning models allow you to explicitly assign a weight to each training example. In that case you can make them worth (say) 150%.)
For my predictor, I have about 28,000 games in the training set. I did an experiment where I took the worst 5600 (~20%) predictions and duplicated them in my training data. Then I trained a new predictor on this "boosted" data and checked its accuracy. I found a very small improvement in accuracy.
Why doesn't this approach boost accuracy for this problem? I speculate that most of the games with large errors are games where "luck" skewed strongly towards one team. Maybe one team shot 70% on their three-pointers, and the other team missed all their free throws. Or one team just had a slightly bad night all-around and the other team had a slightly good night all-around. By definition these sorts of events are unpredictable, so it doesn't benefit our model to try harder to predict them. There's nothing useful in that data for the model to learn.
This insight suggests another approach. If the worst predictions are in fact unpredictable, it might benefit us not to skew our training by trying to predict them. We can do this by removing them from our training set (or de-weighting them if our model permits that). I've never seen this approach suggested in the literature, so I don't have a name for it, but I'll call this "Deboosting."
We can create a deboosted training set in much the same way as a boosted set, except deleting rather than duplicating the worst predictions. I did this and checked its accuracy. Again, this shows only a slight improvement in accuracy.
So why doesn't this approach improve accuracy? I suspect the answer is that the errors in the worst predictions are essentially random -- some in one direction, some in another. So the games are unpredictable, but there's little net effect on the model when we train with these examples, because they tend to cancel each other out. So removing them doesn't have as much impact as we might naively expect.
You can also consider doing similar things with the best predictions. Removing the best predictions is a form of boosting (because it effectively increases the weight of the worst predictions) and duplicating the best predictions is likewise a form of deboosting. You can also experiment with combinations, such as removing the worst examples and duplicating the best examples.
Finally, you can experiment with boosting approaches that make use of your knowledge of the problem domain and/or of attribute values to select the examples to boost. As a simple example, maybe accuracy is improved by boosting the value of all non-conference road games. Or by boosting examples where one team is much stronger than the other team (or conversely, where teams are very evenly matched). There's a lot of room for creative speculation!
Great post. If we're trying to predict the results of NCAA Tournament games, I wonder if boosting neutral court games would be helpful. Or perhaps even more helpful would be boosting/deboosting on a round-by-round predictive basis. For example, for the 1st Round of the tournament, boosting examples where one team is much stronger than the other team (perhaps even varying the boost corresponding to seed-difference between the two teams), and then boosting examples where teams are very evenly matched for later rounds, as teams should be more evenly matched.
ReplyDeleteTyler, that's an interesting thought about predicting the tournament. There are relatively few neutral court games (and even fewer tournament games) so boosting their importance might be particularly useful.
ReplyDelete