The Kaggle Competition asks competitors to estimate the win probabilities for all the possible Tournament games. But for reasons that I'm sure have nothing to do with gambling, many systems -- including mine -- predict Margin of Victory (MOV) rather than probability of winning. So how does one convert a predicted MOV to a win probability?
I started by creating a histogram of predicted MOV versus win probability for 31K games in my training set. (I used predictions from my own system, but you could also do this with the opening or closing Vegas lines.) I binned at 1 point intervals to get the following graph:
This shows that no team predicted to lose by 18 or more points ever won a game, that teams predicted to win by 0 points won half the time, and that teams predicted to win by 25 or more points won every time. That seems pretty reasonable, and it's reassuring that the graph crosses zero at about 50 percent.
I could use this data directly to translate from predicted MOV to win probabilities. If I predict a team is going to win by 10 points I could use this chart to see that its win probability is 83.1% and use that in my Kaggle entry.
There are a couple of minor problems with this. First, even with 31K games there's some obvious noise in the data, particularly at the tail ends of the ranges. Second, I'm allowed two Kaggle entries, so I might want to create my second entry by tweaking this curve. That will be hard to do working with the raw data like this. For these reasons, I'd like a formula for mapping from predicted MOV to win probability.
A simple solution is to do a linear regression on the middle part of the graph.
The result is a pretty good fit. This also reveals there's a little bias in my predictions. If the predictions were perfectly unbiased the constant term in this equation would be 0.50.
However, there's a pretty obvious S shape to this curve that the linear equation is not capturing. I could fit with higher order polynomial (a fourth order equation fits almost perfectly) but there's good reason to believe that what we're really seeing here is a cumulative normal distribution. (That's the familiar bell-shaped curve -- if I were to plot this as a difference between the predicted MOV and the actual MOV that's exactly what we'd see.)
So let's try fitting a cumulative normal distribution to the data.
That's a pretty good-looking match. I did this just by eyeballing the data and picking 0 as the mean and 10 as the standard deviation, but if you want more precision you can do some more complex analysis to get a better fit. (Not unsurprisingly, this corresponds very closely with my mean bias and RMSE for this season.)
Whether you use a simple linear equation or something more complex, you can now tweak your equation to create a new strategy. For example, if I tweak my normal distribution to have a standard deviation of only 6, I get a new curve:
The effect of using this curve is to increase the confidence in my picks -- essentially to gamble that I'm going to be right more than I have been in the past.