## Tuesday, March 22, 2016

### What Would a Perfect (Knowledge) Predictor Score in the Kaggle Competition?

It isn't possible to have a perfect predictor for NCAA Tournament games, because the outcome is probabilistic.  We can't know for sure who is going to win a game.  But we could conceivably have a predictor with perfect knowledge.  This predictor would know the true probability for every game.  That is, if Duke is 75% likely to beat Yale, the perfect knowledge predictor would provide that number.  (Because predicting the true probability results in the best score in the long run.) What would such a predictor score in the Kaggle Contest?

The Kaggle contest uses a log-loss scoring system.  In this system, a correct prediction is worth the log of the confidence of the prediction, and an incorrect prediction is worth one minus the log of the confidence of the prediction.  (And for the Kaggle contest the sign is then swapped so that smaller numbers are better.

Let's return to our example of Duke versus Yale.  Our perfect knowledge predictor predicts Duke over Yale with 0.75 confidence.  What would this predictor score in the long run?  (I.e., if Duke and Yale played thousands of times.)  Since the prediction is also the true probability that Duke will win, that number is given by the equation:

0.75 * ln(0.75) + (1-0.75) * ln(1-0.75)

that is, 75%  of the time Duke will win and in those cases the predictor will score ln(0.75), and 25% of the time Yale will win and the predictor will score ln(0.25).   This happens to come out to about -0.56 (or 0.56 in Kaggle terms).

So we see how to calculate the expected score of our perfect knowledge predictor given the true advantage.  If the favorite in all the Tournament games was 75% likely to win, then our perfect predictor would be expected to score 0.56.  But we don't know the true advantage in Tournament games, and they're all different advantages.  Is there some way we can estimate this?

One approach is to use the historical results.  We know how many games were upsets in past Tournaments, so we can use this to estimate the true advantage.  For example, we can look at all the historical 7 vs. 12 matchups and use the results to estimate the true advantage in those games.  (One problem with this approach is that in every Tournament, some teams are "mis-seeded".  If we judge upsets by seed numbers, this adds some error.)

Between this Wikipedia page and this ESPN page we can determine the win percentages for every possible first-round matchup.  There have been a reasonable number of these matchups (128 for each type of first-round matchup) so we can have at least a modicum of confidence that the historical win percentage is indicative of the true advantage:

SeedWin Pct
1 vs. 16100%
2 vs. 1594%
3 vs. 1484%
4 vs. 1380%
5 vs. 1264%
6 vs. 1164%
7 vs. 1061%
8 vs. 951%

Using the win percentage as the true advantage, we can then calculate what our perfect knowledge predictor would score in each type of match-up:

SeedWin PctScore
1 vs. 16100%0.00
2 vs. 1594%-0.22
3 vs. 1484%-0.45
4 vs. 1380%-0.50
5 vs. 1264%-0.65
6 vs. 1164%-0.65
7 vs. 1061%-0.67
8 vs. 951%-0.69

Since there are equal numbers of each of these games, the average performance of the predictor is just the average of these scores:  -0.48.

This analysis can be extended in a straightforward way to the later rounds of the tournament, but since there are fewer examples in each category it's hard to have much faith in some of those numbers.  But I would expect the later round games to make the perfect knowledge predictor's score worse, because more of those games are going to be close match-ups like the 8 vs. 9 case.

So 0.48 probably represents an optimistic lower bound for performance in the Kaggle competition.

UPDATE #1:

Here's an rough attempt to estimate the performance of the perfect predictor in the other rounds of the Tournament.

According to the Wikipedia page, there have been 52 upsets in the remaining rounds of the Tournament (a rate of about 2%).  If we treat all these games as having an average seed difference of 4 (which is a conservative estimate), then our log-loss score on these games would be about -0.66.  (Intuitively, this is as we would expect -- with most of the low seeds eliminated, games in the later rounds are going to be between teams that are more nearly equal in strength, so our log-loss score will be correspondingly worse.)  Since there are as many first round games as all the other rounds, the overall performance is just the average of -0.48 and -0.66:  0.57.

UPDATE #2:

Over in the Kaggle thread on this topic, Good Spellr pointed out that if you treat the first round games as independent events with a normal distribution, you can estimate the variance as well:

variance = (1/n^2) sum_(i=1)^n p_i*(1 - p_i)*(Log[p_i/(1 - p_i)])^2

which works out to a standard deviation of about 0.07. That means that after the first round of the tournament, the perfect prediction would fall in the range [0.34, 0.62] about 95% of the time.
.