The Kaggle contest uses a log-loss scoring system. In this system, a correct prediction is worth the log of the confidence of the prediction, and an incorrect prediction is worth one minus the log of the confidence of the prediction. (And for the Kaggle contest the sign is then swapped so that smaller numbers are better.
Let's return to our example of Duke versus Yale. Our perfect knowledge predictor predicts Duke over Yale with 0.75 confidence. What would this predictor score in the long run? (I.e., if Duke and Yale played thousands of times.) Since the prediction is also the true probability that Duke will win, that number is given by the equation:
`0.75 * ln(0.75) + (1-0.75) * ln(1-0.75)`
that is, 75% of the time Duke will win and in those cases the predictor will score ln(0.75), and 25% of the time Yale will win and the predictor will score ln(0.25). This happens to come out to about -0.56 (or 0.56 in Kaggle terms).
So we see how to calculate the expected score of our perfect knowledge predictor given the true advantage. If the favorite in all the Tournament games was 75% likely to win, then our perfect predictor would be expected to score 0.56. But we don't know the true advantage in Tournament games, and they're all different advantages. Is there some way we can estimate this?
One approach is to use the historical results. We know how many games were upsets in past Tournaments, so we can use this to estimate the true advantage. For example, we can look at all the historical 7 vs. 12 matchups and use the results to estimate the true advantage in those games. (One problem with this approach is that in every Tournament, some teams are "mis-seeded". If we judge upsets by seed numbers, this adds some error.)
Seed | Win Pct |
---|---|
1 vs. 16 | 100% |
2 vs. 15 | 94% |
3 vs. 14 | 84% |
4 vs. 13 | 80% |
5 vs. 12 | 64% |
6 vs. 11 | 64% |
7 vs. 10 | 61% |
8 vs. 9 | 51% |
Using the win percentage as the true advantage, we can then calculate what our perfect knowledge predictor would score in each type of match-up:
Seed | Win Pct | Score |
---|---|---|
1 vs. 16 | 100% | 0.00 |
2 vs. 15 | 94% | -0.22 |
3 vs. 14 | 84% | -0.45 |
4 vs. 13 | 80% | -0.50 |
5 vs. 12 | 64% | -0.65 |
6 vs. 11 | 64% | -0.65 |
7 vs. 10 | 61% | -0.67 |
8 vs. 9 | 51% | -0.69 |
Since there are equal numbers of each of these games, the average performance of the predictor is just the average of these scores: -0.48.
This analysis can be extended in a straightforward way to the later rounds of the tournament, but since there are fewer examples in each category it's hard to have much faith in some of those numbers. But I would expect the later round games to make the perfect knowledge predictor's score worse, because more of those games are going to be close match-ups like the 8 vs. 9 case.
So 0.48 probably represents an optimistic lower bound for performance in the Kaggle competition.
UPDATE #1:
Here's an rough attempt to estimate the performance of the perfect predictor in the other rounds of the Tournament.
According to the Wikipedia page, there have been 52 upsets in the remaining rounds of the Tournament (a rate of about 2%). If we treat all these games as having an average seed difference of 4 (which is a conservative estimate), then our log-loss score on these games would be about -0.66. (Intuitively, this is as we would expect -- with most of the low seeds eliminated, games in the later rounds are going to be between teams that are more nearly equal in strength, so our log-loss score will be correspondingly worse.) Since there are as many first round games as all the other rounds, the overall performance is just the average of -0.48 and -0.66: 0.57.
UPDATE #2:
Over in the Kaggle thread on this topic, Good Spellr pointed out that if you treat the first round games as independent events with a normal distribution, you can estimate the variance as well:
.
Scoring these kinds of things is hard because overconfidence can be rewarded (and it is more likely to be if you're on the right side of 50% in the long run). If you think of an infinite number of Kaggle competitions, anyway. It'd be interesting to see someone create a series of NCAA-like seasons of data with known true probability rates and run the various Kaggle submissions against them and see who picks the correct perfect knowledge.
ReplyDeleteSomething that I think is missing from Kaggle is model skill. This article covers it well for the uninitiated: http://fivethirtyeight.com/features/when-picking-a-bracket-its-easier-to-be-accurate-than-skillful/). Logs of probability get a little bit at skill because being more certain of the correct response is generally more skillful. However, if you just said you were 100% certain of 2v15 and a 3v14, you'd get the same score for both, when it's less impressive to predict the 2v15 (where you were only 6% more certain than naive historical perspectives). But of course there is the problem of picking a historically reasonable 'naive' baseline for skill estimation as you have pointed out.
This comment has been removed by the author.
ReplyDeleteGood post, this will surely help me to predict scores of tournment starting in my office next week. Thank you for sharing it
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete