The Kaggle contest uses a log-loss scoring system. In this system, a correct prediction is worth the log of the confidence of the prediction, and an incorrect prediction is worth one minus the log of the confidence of the prediction. (And for the Kaggle contest the sign is then swapped so that smaller numbers are better.
Let's return to our example of Duke versus Yale. Our perfect knowledge predictor predicts Duke over Yale with 0.75 confidence. What would this predictor score in the long run? (I.e., if Duke and Yale played thousands of times.) Since the prediction is also the true probability that Duke will win, that number is given by the equation:
`0.75 * ln(0.75) + (1-0.75) * ln(1-0.75)`
that is, 75% of the time Duke will win and in those cases the predictor will score ln(0.75), and 25% of the time Yale will win and the predictor will score ln(0.25). This happens to come out to about -0.56 (or 0.56 in Kaggle terms).
So we see how to calculate the expected score of our perfect knowledge predictor given the true advantage. If the favorite in all the Tournament games was 75% likely to win, then our perfect predictor would be expected to score 0.56. But we don't know the true advantage in Tournament games, and they're all different advantages. Is there some way we can estimate this?
One approach is to use the historical results. We know how many games were upsets in past Tournaments, so we can use this to estimate the true advantage. For example, we can look at all the historical 7 vs. 12 matchups and use the results to estimate the true advantage in those games. (One problem with this approach is that in every Tournament, some teams are "mis-seeded". If we judge upsets by seed numbers, this adds some error.)
|1 vs. 16||100%|
|2 vs. 15||94%|
|3 vs. 14||84%|
|4 vs. 13||80%|
|5 vs. 12||64%|
|6 vs. 11||64%|
|7 vs. 10||61%|
|8 vs. 9||51%|
Using the win percentage as the true advantage, we can then calculate what our perfect knowledge predictor would score in each type of match-up:
|1 vs. 16||100%||0.00|
|2 vs. 15||94%||-0.22|
|3 vs. 14||84%||-0.45|
|4 vs. 13||80%||-0.50|
|5 vs. 12||64%||-0.65|
|6 vs. 11||64%||-0.65|
|7 vs. 10||61%||-0.67|
|8 vs. 9||51%||-0.69|
Since there are equal numbers of each of these games, the average performance of the predictor is just the average of these scores: -0.48.
This analysis can be extended in a straightforward way to the later rounds of the tournament, but since there are fewer examples in each category it's hard to have much faith in some of those numbers. But I would expect the later round games to make the perfect knowledge predictor's score worse, because more of those games are going to be close match-ups like the 8 vs. 9 case.
So 0.48 probably represents an optimistic lower bound for performance in the Kaggle competition.
Here's an rough attempt to estimate the performance of the perfect predictor in the other rounds of the Tournament.
According to the Wikipedia page, there have been 52 upsets in the remaining rounds of the Tournament (a rate of about 2%). If we treat all these games as having an average seed difference of 4 (which is a conservative estimate), then our log-loss score on these games would be about -0.66. (Intuitively, this is as we would expect -- with most of the low seeds eliminated, games in the later rounds are going to be between teams that are more nearly equal in strength, so our log-loss score will be correspondingly worse.) Since there are as many first round games as all the other rounds, the overall performance is just the average of -0.48 and -0.66: 0.57.
Over in the Kaggle thread on this topic, Good Spellr pointed out that if you treat the first round games as independent events with a normal distribution, you can estimate the variance as well: