Friday, February 27, 2015

Five Mistakes Kaggle Competitors Should Avoid

#1 -- Don't think you're going to win the competition.

One of the results that came out of the analysis of last year's contest is that the winner was essentially random:  at least the top 200 entries could have credibly won the contest.  Why?  Evidence from the best predictors suggests that there is about 8 points or so of unpredictability in college basketball games.  That's a lot of randomness.  Last year, 32 of the 64 games in the Tournament were decided by 8 points or less.  So even if you have the most accurate predictor in the contest, you're almost certain to be beaten by someone who made a worse prediction and got lucky when it came true.  It's the same reason why the ESPN pool is usually won by an octopus or someone who picked based on mascot fashions. On the other hand, maybe this year you'll be the guy who gets lucky.  It could happen.

#2 -- Don't use the data from before the 2008-2009 season.

Isn't it nice of the Kaggle administrators to provide data back to 1985?

If you're not familiar with college basketball, you might not realize that the college game underwent a radical change at the beginning of the 2008-2009 season when the NCAA instituted the three-point shot at a consistent distance of 20' 9".  The three-point shot created whole new game strategies, and data from before that season is probably not easily applicable to today's game.

#3 -- The Tournament data is not enough for training or testing.

More like March Sadness

At 64 games a year, the Tournament just doesn't provide enough data for training or even testing a predictor with any reliability.  You may think you're being smart to build your model specifically for the Tournament -- imagine the advantage you'll have over all the other competitors that don't understand how different the Tournament is from the regular season.  Ha!

But actually you're just overfitting your model.   My own predictor needs about 15,000 training examples for best performance.  Your mileage may vary -- maybe you only need 14,000 training examples -- but there just isn't enough information in the Tournament games alone to do accurate prediction.  Particularly since you shouldn't use the games from before 2008 (see #2).  Of course, you can do all that and you might still win the contest (see #1).

#4 -- Beware of leakage!

Guess what?  It turns out that you can do a really good job of predicting the Tournament if you know the results ahead of time.  Who knew?

Now that's not a big problem in the real contest because (short of psychic powers) no one knows the results ahead of time.  But if the forums from last year and this year are any indication, it's a big problem for many Kagglers as they build and test their predictors.  Knowledge from the games they're testing creeps into the model and results in unrealistically good performance.

A First-Time Kaggle Competitor
There are three major ways this happens.

The first and most obvious way this happens is that a model is trained and tested on the same data.  In some cases you can get away with doing this -- particularly if you have a lot of data and a model without many degrees of freedom.  But that isn't the case for most of the Kaggle models.  If you train your model on the Tournament data and then test it on the same data (or a subset of the data), it's probably going to perform unreasonably well.  You address this by setting aside the test data so that it's not part of the training data.  For example, you could train on the Tournament data from 2008 to 2013 and then test on the 2014 Tournament.  (Although see #3 above about using just the Tournament data.)  Cross-validation is another, more robust approach to avoiding this problem.

The second way this often happens is that you unwittingly use input data that contains information about the test games.  A lot of Kagglers use data like Sagarin's ratings without understanding how these statistics are created.  (I'm looking at you, Team Harvard.)  Unless you are careful this can result in information about the test games leaking back into your model.  The most common error is using ratings or statistics from the end of the season to train a model for games earlier in the season.  For example, Sagarin's final ratings are based upon all the games played that season -- including the Tournament games -- so if you use those ratings, they already include information about the Tournament games.  But there are more subtle leaks as well, particularly if you're calculating your own statistics.

The third and least obvious way this happens is when you tune your model.  Imagine that you are building your model, taking care to separate out your test data and avoid using tainted ratings.  You test your model on the 2014 Tournament and get mediocre results.  So you tweak one of your model parameters and test your model again, and your results have improved.  That's great!  Or is it?  In fact, what you've done is leak information about the 2014 Tournament back into your model.  (This can also be seen as a type of overfitting to your test data.)  This problem is more difficult to avoid, because tuning is an important part of the model building process.  One hedge is to use robust cross-validation rather than a single test set.  This helps keep your tuning more general.

How can you tell when you're suffering from leakage?  Your performance can provide an indicator.  Last year's winner had a log-loss score of 0.52, and the median score was around 0.58.  If your predictor is getting performance significantly better than those numbers, then you're either (1) a genius, or (2) have a problem.  It's up to you to decide which.

#5 -- A Miscellany of Important Notes
  • College basketball has a significant home court advantage (HCA).  (And yes, there may be a home court advantage in Tournament games!) Your model needs to account for the HCA and how it differs for neutral court and Tournament games.  If your model doesn't distinguish home and away, you've got a problem.
  • College teams change significantly from season to season.  You can't use a team's performance in one season to predict its performance in another season.  (This seems obvious, but last year's Harvard team seems to have made this mistake.  On the other hand, they got a journal publication out of it, so if you're an academic this might work for you too.)
  • Entering your best predictions might not be the best way to win the contest.  Since the contest has a large random element (see #1 above) your best strategy might be to skew your predictions in some way to distinguish yourself from similar entries, i.e., you should think about meta-strategy.


  1. For point #2, the NCAA didn't adopt the 3-point line in 2008-09, they just moved the line back a foot. It did impact 3P% by about 0.8%. Depending on the model, it may or may not be worth it to try and include data prior to 08-09 based on that, but I don't think it's a radical shift.

  2. Fair point, Monte. I think of it as a major change. You should go beat me up on the Kaggle forum though, not just here! :-)

  3. It is interesting to see if it is a radical enough change to offset the smaller dataset, and really I guess that's always the question--which is smaller, the noise from the game changing over time, or the noise from having less data to train/test on. I'll have to check out the Kaggle forum, didn't know it was active already!