Monday, January 5, 2015

The Effect of Additional Data on Performance

I've been wondering whether having more training data (i.e., additional seasons of games) would further improve my predictor.  This is problematic, because I already have data back to when the 3 point shot was introduced in the 2009-2010 season, so I can't actually get any more usable data.  But the question persisted, so I did a quick and dirty experiment to try to characterize how much improvement I'll see with additional data.

I trained a model on differing amounts of training data and tested it on the entire training set.  Ideally, I'd do this as some sort of cross-fold validation, picking different slices of the data for training, but I didn't want to spend the time that would require, so I just did each trial once.  So there's necessarily a lot of fuzziness in these results, but I still think the result is instructive.  The plot of error versus amount of training data looks like this:

That's error along the Y axis and number of training examples along the X.  You can see that error falls fairly steeply for the first 10K or so training examples and then begins to level off.  (Although it continues to slowly decrease.)  Eyeballing this chart suggests that additional data isn't likely to provide any big improvement.

If you're building your own predictor, this suggests that you should try to get at least 15K games for training data.  Depending upon how many games you throw out from the early season, that's around 3 full seasons of games.

This also shows the folly of trying to build a Tournament predictor based upon past Tournament games.  At 63 games a year, you'd need about 238 years of Tournament results to get a decent error rate :-).

No comments:

Post a Comment