Friday, February 20, 2015

Kaggle Competition: From Point Spreads to Win Percentage

The Kaggle Competition asks competitors to estimate the win probabilities for all the possible Tournament games.  But for reasons that I'm sure have nothing to do with gambling, many systems -- including mine -- predict Margin of Victory (MOV) rather than probability of winning.  So how does one convert a predicted MOV to a win probability?

I started by creating a histogram of predicted MOV versus win probability for 31K games in my training set.  (I used predictions from my own system, but you could also do this with the opening or closing Vegas lines.)  I binned at 1 point intervals to get the following graph:

This shows that no team predicted to lose by 18 or more points ever won a game, that teams predicted to win by 0 points won half the time, and that teams predicted to win by 25 or more points won every time.  That seems pretty reasonable, and it's reassuring that the graph crosses zero at about 50 percent.

I could use this data directly to translate from predicted MOV to win probabilities.  If I predict a team is going to win by 10 points I could use this chart to see that its win probability is 83.1% and use that in my Kaggle entry.

There are a couple of minor problems with this.  First, even with 31K games there's some obvious noise in the data, particularly at the tail ends of the ranges.  Second, I'm allowed two Kaggle entries, so I might want to create my second entry by tweaking this curve.  That will be hard to do working with the raw data like this.  For these reasons, I'd like a formula for mapping from predicted MOV to win probability.

A simple solution is to do a linear regression on the middle part of the graph.

The result is a pretty good fit.  This also reveals there's a little bias in my predictions.  If the predictions were perfectly unbiased the constant term in this equation would be 0.50.

However, there's a pretty obvious S shape to this curve that the linear equation is not capturing.  I could fit with higher order polynomial (a fourth order equation fits almost perfectly) but there's good reason to believe that what we're really seeing here is a cumulative normal distribution.  (That's the familiar bell-shaped curve -- if I were to plot this as a difference between the predicted MOV and the actual MOV that's exactly what we'd see.)

So let's try fitting a cumulative normal distribution to the data.

That's a pretty good-looking match.  I did this just by eyeballing the data and picking 0 as the mean and 10 as the standard deviation, but if you want more precision you can do some more complex analysis to get a better fit.  (Not unsurprisingly, this corresponds very closely with my mean bias and RMSE for this season.)

Whether you use a simple linear equation or something more complex, you can now tweak your equation to create a new strategy.  For example, if I tweak my normal distribution to have a standard deviation of only 6, I get a new curve:

The effect of using this curve is to increase the confidence in my picks -- essentially to gamble that I'm going to be right more than I have been in the past. 


  1. Nevermind.. I get it.. I was trying to generate the probability off of my own data, and was getting confused.

    I think something like this makes it much easier..

  2. I didn't know that was available on Team Rankings, but yeah you could grab that data and use it directly or do what I've done above. However, you might be introducing some error if the performance of your predictor differs significantly from the closing line used in the Team Rankings data.

  3. Thanks, I think I have it figured out.. this is the data from TeamRankings, plotted with the CDF given mu=0, s=10.

    So for my own model, I'd basically have to predict every game that I have the data for, the bin the predicted margin of victories, and for each bin, take the associated W/L outcomes and average them..

    so if we're looking at bin -15 and I have 40 games.. the outcome that prediction was correct = 1, incorrect = 0.. add up all the 40 1's, and 0's and divide by 40 to get the probability that model correctly predicts a winner given MOV = -15..

    Then when I've got all the data plot it, and attempt to best fit the data with CDF (specifying mu / s until it looks good), then I can just use the distribution function (with above mu / s) given any predicted MOV to get my p-value..

    Did I get all of that right?

  4. That's exactly right!

    I think last year I put together a spreadsheet to fit the CDF curve using Excel's Solver function. I'll try to dig that up if you're interested. (Although I suspect eyeballing it is precise enough given the random nature of the Kaggle competition.)

  5. I'm actually not doing the Kaggle competition, so no worries, I was more interested in it as a p-value to associate with my predictions. I'll definitely do that when I have the time to think through the logistics of predicting every game since 2008, but it'll be a bit.. Still working on optimizing feature selection for the new score differential model.

    I probably won't get to it until after bracket time, but I'll definitely have it in place for next year! Shouldn't be too hard to code once I generate all the predictions.

  6. Any chance that either of you could share a data table that maps from predicted MOV to probability of victory based on the analysis you described (i.e. the data points behind the graphs you posted)? I have a not-for-profit March Madness site ( that runs Monte Carlo simulations to provide continuous updates on the likelihood that each entrant will win their pool, and I've been using's data to predict the outcomes of games. In past years he posted a Pythagorean win percentage for each team that I could incorporate into a simple log5 formula to estimate the likelihood that one team would defeat another, but I realized last night that he's switched to adjusted efficiency margin -- so I figured I would just update my site to compute the difference between the two teams' AdjEM and then map that to a probability. I could use the data directly from TeamRankings (with a minor adjustment for the 100% cases), but I'd much prefer to use the output from one of your models (which are a bit beyond what I'm capable of producing at this moment given time and knowledge constraints). Thanks!

  7. I'm not any of them, but happened to come across your comment as I am making my own Excel file for brackets this year.

    If you follow the method from the website below, you can still calculate win probabilities from just the AdjO and AdjD numbers. Kenpom's methodology has only changed slightly when it comes to determining the margin of the game (home court advantage, which doesn't matter during March madness) also lets you compare any 2 teams and the percent probability each has to win the game to confirm your calculations.

    1. Thanks very much! The numbers from don't quite seem to match what you get from that method, but they're reasonably close, which is plenty good for my purposes.