## Monday, March 2, 2015

### Reasonable Kaggle Performance

The first stage of the Kaggle competition involves Kagglers testing out their models against data from the past few basketball seasons, and these scores appear on the first stage leaderboard.  Invariably new Kagglers make some fundamental mistakes and end up submitting entries with unreasonably good performance.  The administrators of the contest have taken to removing these entries to avoid discouraging other competitors.  The line for removing entries is somewhat fuzzy, and it begs the question1 "What is a reasonable long-term performance for a Tournament predictor?" There are probably many ways to answer this question,2 but here's one approach that I think is reasonable:  Calculate the performance of the best possible predictor over an infinite number of Tournaments.

I am reminded at this point of an old joke.
A man is sitting in a bar complaining to his friend -- who happens to be a physicist -- about his awful luck at the racing track, and wishing he had some better way to know what horse was going to win each race.

"Well, that strikes me as a rather simple physics problem," his friend says.  "I'm sure I could build a model to predict the outcome."

"Really?" says the man, visibly excited.  "That's fantastic.  We'll both get rich!"

So the physicist goes off to build his model.  After a week, the man has still heard nothing, so he calls his friend.  "How are you doing on the model?" he asks.

"Well," says the physicist.  "I admit that it is turning out to be a bit more complicated than I imagined.  But I'm very close."

"Great," says the man.  "I can't wait!"

But another week goes by and the man hears nothing, so he calls again.

"Don't bother me," snaps the physicist.  "I've been working on this day and night.  I'm very close to a breakthrough!"

So the man leaves his friend alone.  Weeks pass, when suddenly the man is awakened in the middle of the night by a furious pounding on his front door.  He opens the door and sees his friend the physicist.  He looks terrible -- gaunt and strained, his hair a mess -- and he is clutching a sheaf of crumpled papers.  "I have it!" he shouts as the door opens.  "With this model we can predict the winner of any horse race!"

The man's face lights up.  "I can't believe you did it," he says.  "Tell me how it works."

"First of all," says the physicist, "we assume that the horses are perfect spheres racing in a vacuum..."
Like the physicist, we face a couple difficulties.  For one thing, we don't have the best possible predictor.  For another, we don't have an infinite set of Tournaments.  No matter, we shall push on.

We don't have the best possible predictor (or even know what its performance would be) but we do have some data from the best known predictors and we can use that as a substitute.  The Vegas opening line is generally acknowledged to be the best known predictor (although a few predictors do manage to consistently beat the closing line, albeit by small margins).  The Vegas opening line predicts around 74% of the games correctly "straight up" (which is what the Kaggle contest requires). I'm personally dubious that anyone can improve upon this figure significantly3 but for the sake of this analysis let's assume that the best possible predictor can predict an average game4 correctly 80% of the time.

We also don't have an infinite number of Tournaments to predict, but we can assume that the average score on an infinite number of Tournament games will tend towards the score on an average Tournament game.  For the log-loss scoring function, the best score in the long run comes from predicting our actual confidence (the 80% from above).  If we predict an infinite number of games at 80% and get 80% of those games correct, our score is:

0.80*log(0.80) + (1-0.80)*log(1-0.80)

which turns out (fortuitously) to be just about 0.50.  (If we use a performance of 74%, the score is about 0.57.)

This analysis suggests that the theoretical best score we can expect predicting a large number of Tournament games is around 0.50 (and probably closer to 0.57).  This agrees well with last year's results -- the winner had a score of about 0.52 and the median score was about 0.58.

As far as "administrative removal" goes, there are 252 scored games in the Kaggle stage one test set.  That's not an infinite set of games, but it is enough to exert a strong regression towards the mean.  The Kaggle administrators are probably justified in removing any entry with a score below 0.45.

On a practical level, if your predictor is performing significantly better than about 0.55 for the test data, it strongly suggests that you have a problem.  The most likely problems are that you are leaking information into your solution or that you are overfitting your model to the test data.

Or, you know, you could be a genius.  That's always a possibility.

1 Yes, I know I'm misusing  "beg the question".
2 I suspect that a better approach treats the games within the Tournament as a normal distribution and sums over the distribution to find the average performance, but that's rather too much work for me to attempt.
3 If for no other reason than Vegas has a huge financial incentive to improve this number if they could.
4 The performance of the Vegas line is an average over many games.  Some games (like huge mismatches) the Vegas line predicts better than 74%; some (like very close matchups) it predicts closer to 50%.  I'm making the simplifying assumption here that the average over all the games corresponds to the performance on an average game.  Later on I make the implicit assumption that the distribution of Tournament games is the same as the distribution of games for which we have a Vegas line.  You can quarrel with either of these assumptions if you'd like.  A quick analysis of the Tournament games since 2006 shows that the Vegas line is only right 68% of the time, suggesting that Tournament games may be harder to predict than the average game.