Net Prophet: April 2011

Friday, April 29, 2011

Trueskill

We'll now turn our attention to the first of the "RPI Alternatives" we will evaluate. Like RPI, all of these approaches make use of only the won-loss record in rating teams.

The first approach is Microsoft's "Trueskill" rating system. The Trueskill system was developed at Microsoft Research to rank players on XBox Live. This has a number of unique challenges. First, there are many players, so most players have not played against each other. Second, many of the games on XBox Live are multiplayer or team games, and players may join or drop out of the game at different points in the game play. Finally, Trueskill is in some sense intended to be a predictive system. It's main use is to predict how competitive a game will be between two players.

As a rating system, Trueskill also has a couple of unique features. For each player it calculates not only a rating, but also an explicit uncertainty in the rating. After one game, Trueskill will provide a rating for the teams involved, but the uncertainty in the ratings will be high. As more games are played, the uncertainties become less.

What happens when a player performs better or worse than expected? Trueskill can either move the player's rating, or change the uncertainty, or some of both. There is explicit provision in the Trueskill system to tune this tradeoff.

Finally, Trueskill also accommodates games that can end in a draw.

The Trueskill system is based upon Bayesian inferencing. The fundamental ideas are not hard to grasp, but the details can be daunting. Fortunately, Jeff Moser has provided a very clear tutorial on Trueskill, which you can find here. Jeff also provides an implementation of Trueskill in C#, and was instrumental in helping me create the Trueskill implementation in Lisp, which you can download here.

College basketball is simpler than XBox Live in that we don't have to worry about multiplayer games or players dropping out before a game is finished. So to test Trueskill for predicting college basketball games, I was able to implement the simplest Trueskill algorithm: one that deals with just two player games.

As mentioned above, Trueskill accommodates draws. This is nice, since we showed earlier with RPI that it improved prediction accuracy to consider some games as draws. It's worth noting that Trueskill treats draws somewhat differently than we did with RPI. In the RPI tweak, we set an MOV cutoff and ignored games that fell below that cutoff. In Trueskill, there is a similar cutoff that identifies drawn games, and these games are ignored when updating a team's rating. However, Trueskill also uses the likelihood of a draw to control the impact a non-drawn game has on a team's ratings. For example, if draws are very likely, then a non-drawn game has a big impact upon ratings. This makes intuitive sense -- if it's very hard to get a decisive win, then that should be strong evidence that the winning team is better than the losing team.

To make use of draws, we need to tell Trueskill what point differential counts as a draw, as well as how likely draws are to occur. I used the last three seasons of games to determine the likelihood that a game will be decided by "N" or fewer points:

Point Spread	Likelihood
1 Point	4.5%
2 Points	10.5%
3 Points	17.0%
4 Points	22.3%
5 Points	28.7%
6 Points	34.0%
7 Points	39.1%
8 Points	44.0%
9 Points	49.3%
10 Points	54.2%
11 Points	58.6%
12 Points	62.7%

We can then test with draws set at various levels to find the best performance. We test here just as we did with RPI -- our main measure of performance is the error in predicted Margin of Victory (MOV):

Draw	Performance
2 Points	11.29
3 Points	11.24
4 Points	11.19
5 Points	11.16
6 Points	11.13
7 Points	11.13
8 Points	11.09
9 Points	11.10
10 Points	11.13

As this table shows, best performance is achieved with an amazingly high level of draws -- 8 points, which drops 44% of the played games from consideration. The performance of the Trueskill algorithm at this setting is also signficantly better than our best RPI algorithm:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (infinite depth, mov-cutoff=1)	72.1%	11.30
Trueskill (draw=8 points)	72.8%	11.09

To this point, we've made no use of the uncertainty measure that the Trueskill algorithm provides. And it isn't clear how to make use of it. We're using the Trueskill ratings for each team as inputs to a linear regression, resulting in an equation that looks something like this:

MOV = 1.514*HTrueskill - 1.462*ATrueskill + 2.449

Adding the uncertainty measures to this regression doesn't seem like it will add any useful information for determining the MOV, and indeed, when they are added they get optimized out of the regression. Another approach is exemplified by the Microsoft's Leaderboard, which uses a "conservative" strength estimate calculated by subtracting three times the uncertainty measure from the rating. Using this as the inputs to our predictor also fails to improve our predictive accuracy. Similar experiments with adding the uncertainty to the home team and subtracting it from the away team, etc., all fail to provide any improvements. So while the explicit uncertainty might prove to be useful for a more sophisticated predictor, it doesn't appear to provide any value for a simple linear regression.

Another area of tweaking we can look at for Trueskill is home court advantage (HCA). As with RPI, the linear regression effectively adjusts for the home court advantage, but in this case manual adjusting might provide some additional benefits. If we adjust for HCA by (say) subtracting points from the home team's score, it will affect which games are considered "draws" by the Trueskill algorithm, and this may lead to improved accuracy. Here is the performance with the HCA set to various values, including the Dick Vitale Approach:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
Trueskill (draw=8 points)	72.8%	11.09
Trueskill (draw=8 points, HCA=2.5)	72.5%	11.13
Trueskill (draw=8 points, HCA=3.5)	72.4%	11.13
Trueskill (draw=1 point, Vitale)	70.0%	12.09
Trueskill (draw=8 points, Vitale)	70.8%	11.80

I could find no adjustment for HCA which improved the predictive performance.

The other tweaks we tried for RPI (such as weighting recent games more heavily) do not easily apply to the Trueskill algorithm.

Thursday, April 28, 2011

Testing Methodology Redux

For the RPI experiments, I used a testing methodology that tried each RPI variant against the same set of approximately 10K training games and 500 test games. This had the advantage of being fast and repeatable. However, it has the disadvantage that performance on the 500 test games might not accurately estimate the general performance. That is, we might have a tweak that (for whatever reason) happens to perform very well (or very poorly) on that particular set of test games. As we move forward into other ratings and more complex models, we'd like to avoid that problem.

To do that, we can test our algorithms on several different test sets. The general approach is called cross-validation. The basic idea is to split the input set into a large training set and a smaller test set, train the algorithm on the training set, test it on the test set, and then repeat for a new training & test set. The more test sets we use, the closer we can come to accurately estimating the true performance of the algorithm. The drawback is that testing becomes slower because of the repeated training-testing loop.

RapidMiner provides cross-validation as a standard operator. This picture shows the top-level process flow I am using in testing:

The flow begins at the top left, where the entire set of game data is read in from a CSV (comma-separated values) file. Each game has the date, home team, away team, scores, etc., as well as the computed ratings to be tested -- for example, we might have the basic RPI rating for each team. (The computed ratings are produced by a Lisp program that must be run before the cross-validation.)

The game data is then subject to some preprocessing to get it ready for use in the predictive model. The first step is to generate a unique ID for each game. (This is useful if we split the data from a game into two parts and want to later re-join the parts.) Next is a "Generate Attributes" operator, which takes the home team score and the away team score and creates a derived attribute called MOV (the Margin of Victory). The "Set Role" operator then marks this new attribute as the "label" in our data -- the label is what we are trying to predict. Finally, we use a "Select Attribute" operator to select only those attributes in the game data that we want to use as inputs to our predictive model. For example, if we are testing RPI, we'd select only the home team's RPI and the away team's RPI (along with the label) as inputs to the model.

The yellow Validation operator encapsulates the cross-validation process. It takes as inputs the preprocessed training data and outputs a model, the same training data, and some computed performance estimates. In RapidMiner, we can drill down inside this operator to look at the cross-validation process:

This process is divided into two halves: Training and Testing. RapidMiner takes care of splitting the data into a training set and a testing set.

The training set is fed into the Training side of the cross-validation as the input. The Training side takes this input and produces a model. In this case, we're using a Linear Regression operator that takes the training data and produces a linear regression that best fits the input attributes (Home RPI, Away RPI) to the label data (the Margin of Victory). The model is then output from the Training side and passed over to the Testing side.

The Testing side of the cross-validation takes the model and the test data and outputs performance values. In this case, we use the "Apply Model" operator to apply the model from the Training side to the test data. This produces "labelled data" -- the test data with an added attribute called "Predicted(MOV)".

In this case, I want to keep track of two measures of performance: the number of correctly predicted games, and the root mean squared error in the MOV prediction. To do this, a copy is made of the labelled data using the Multiply operator. One copy is sent down to the yellow Performance operator at the bottom of the figure. This is a built-in RapidMiner operator that calculates root mean squared error from the labelled data. This result is then pushed out as a performance measure for this model.

The other copy of the labelled data is sent into the pink boxes at the top of the figure. These boxes rename the "Prediction(MOV)" attribute, generate a new attribute which is 1 if the prediction was correct and 0 otherwise, and then aggregates (sums) the new attribute. This number is then converted to a performance measure and pushed out as another performance measure for this model.

When this process is run, RapidMiner splits the data into training and test sets according to the parameters of the Cross-Validation operator (in this case, I'm doing 100 cross-validations), runs the Training/Testing process on each data set, and averages the performance measures across all the sets:

RapidMiner can also produce a variety of plots of attributes and results, such as this:

This can be handy for debugging or to visualize the data.

Tuesday, April 26, 2011

RPI Recap

I want to take a posting to recap the various approaches we've looked at over the past few weeks. Although I've been applying them to the RPI algorithm, I believe they'll continue to be generally useful as we look at more complex prediction approaches.

(1) Measure

The initial insight was to have a clear notion of what we're trying to achieve and then select objective metrics to measure progress. The importance of this was repeatedly evident as we looked at various RPI tweaks. In many cases, "obvious" improvements to RPI turned out to be no improvement at all. In another case, we corrected a math error in RPI that turned out not be an error at all (or at any rate didn't improve predictive performance).

(2) Home Court Advantage

There is a proven and significant home court advantage in college basketball. We looked at several ways to account for HCA, but in the end our predictive model captured it better than we could with an apriori solution. For example, the linear regression for one of our RPI tweaks looked like this:

MOV = 85.414 * Hrpi - 80.508 * Arpi + 1.580

The different coefficients for the home team's RPI (85.414) and the away team's RPI (80.508), as well as the constant bias (1.580) combine to model the home court advantage.

The lesson here is that the HCA is important to accurate prediction, and we need to ensure that either our model accommodates it naturally (as in the case above) or that we otherwise account for it in our data. For the latter case we looked at a number of possible tools: weighting the home record differently, applying a point bias to the home team, or splitting a team into a "home team" and "away team" component.

(3) Strength of Opponent

One of the paradoxical challenges of assessing a team's strength is that you need to know the strength of the teams it has played. It's a classic Catch-22.

The RPI approach to breaking this death spiral is to base the strength metric on some other measure. This is what RPI does -- RPI tells us how strong a team is, but RPI itself is eventually dependent on won-loss records. By recursively finding the winning percentage for opponents, and opponents' opponents, RPI tries to estimate the true strength of a team.

One useful improvement we found on this technique is to carry out this recursion "infinitely" by using an iterative solution. We can use another measure as an initial estimate of our strength metric, and then iteratively adjust the metric until we get the values that best match the actual performance so far.

(4) Data Filtering

Our prediction algorithms are based largely on past performance. One approach we used with some success for RPI was to filter the past performance to eliminate games that reduced our prediction accuracy. In the case of RPI, we found in it was useful to eliminate games where the MOV was 1 point. In general, we may want to look at a variety of different filtering approaches (e.g., eliminate blow-out games, pre-season tournament games, etc.).

(5) Modeling Changing Performance

As we build predictive models, we have to consider whether a team's performance changes substantially over the course of a season. (It clearly does so from season to season.) If it does, then our predictive accuracy might be better if we discount older games when building our models. Another intriguing possibility is whether we can identify specific events where a team's performance changed substantially, e.g., when we notice the minutes played by specific players changes signficantly due to an injury or other reason.

In the case of RPI, we weren't able to improve our accuracy by weighting recent games, but other methods or different approaches may yet prove valuable.

Next up we'll start taking a look at some alternative methods for rating teams that use only won-loss records. There are a number of candidates, and we'll be looking to see if any of them provide a significant advantage over our (tweaked) RPI.

Monday, April 25, 2011

The Recency (Non-)Effect

Our final RPI tweak (unless I'm lying again) will look at the impact of recent results on predictability. It's reasonable to suppose that a team's level of performance might change during the season -- that is, it might get better or worse as the season goes along. In this case,recent games might be a better predictor of future performance than older games

To test this notion, we can modify our RPI calculations so that they take into account only the last "N" games. For some representative values of "N", that yields these results:

Predictor	N	% Correct	MOV Error
1-Bit		62.6%	14.17
RPI (nw, 15+15+70)		76.8%	11.46
RPI (nw, 15+15+70)	4	67.6%	12.52
RPI (nw, 15+15+70)	8	72.8%	11.89
RPI (nw, 15+15+70)	16	74.2%	11.47

Restricting the RPI calculations to the most recent games has a strong negative impact on predictive power. Of course, this method is fairly drastic: it gives 100% value to the most recent games and 0% to anything older. A more nuanced approach would count the most recent games more, but not discount entirely the older games. Something like a weighted moving average would be ideal, but it isn't entirely obvious how to apply that to RPI. Instead, we'll take the approach of counting the most recent games more than once. That is, we'll treat each team as if it played it's most recent games multiple times (with the same results). This will have the effect of weighting those games correspondingly more.

This table shows the impact of repeating some number of recent games some number of times (in addition to counting all games once):

Predictor	N	Repeats	% Correct	MOV Error
RPI (nw, 15+15+70)	4	1	72.8%	11.63
RPI (nw, 15+15+70)	8	1	74.2%	11.56
RPI (nw, 15+15+70)	16	1	74.6%	11.49
RPI (nw, 15+15+70)	4	1/3	73.6%	11.64
RPI (nw, 15+15+70)	8	1/3	75.4%	11.58
RPI (nw, 15+15+70)	16	1/3	75.2%	11.51
RPI (nw, 15+15+70)	2	2	70.6%	12.49
RPI (nw, 15+15+70)	1	2	66.2%	12.82

In no case that I could find did weighting recent games improve performance over the baseline. Putting emphasis on a small number of recent games is particular bad; this suggests that if teams do change performance over the course of the season it is only slowly.

Next time (unless something shiny distracts me again), we'll sum up the various tweaks we've tried with RPI.

Friday, April 22, 2011

MOV Cutoff Filter

The next tweak we'll look at for RPI goes back to our previous discussion about the limits of prediction. We noted there that the last possession of the game could swing the final game score by 6 points. Even if it's hard to quantify exactly, there's a certain random component in final game scores. That's particularly a concern for rating systems that rely only on won-loss records, because a swing of a few points in a close game could change a game from a win to a loss. Intuitively at least, we might want to discount close games when calculating RPI, under the theory that they're not really good evidence that one team was better than the other.

The easiest way to do this is to filter out all games where the final MOV was less than some threshold when computing a team's RPI. Making that change and applying it with various thresholds to our current best "% Correct" RPI variation gives these results:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unw,15+15+70)	75.4%	11.49
RPI (nw, 15+15+70, mov-cutoff=1)	76.8%	11.46
RPI (nw, 15+15+70, mov-cutoff=3)	75.4%	11.56
RPI (nw, 15+15+70, mov-cutoff=8)	73.4%	11.62
RPI (nw, 15+15+70, mov-cutoff=12)	70.0%	11.98

Filtering out all games that were decided by 1 point provides a big improvement in "% Correct" and a small improvement in "MOV Error". It's also interesting to note how resilient RPI is to removing games. MOV cutoff = 12 removes more than 40% of the games and only introduces a few percent more error.

We can try the same technique with our current best "MOV Error" variation (the infinitely deep RPI):

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (improved)	74.6%	11.33
RPI (improved, mov-cutoff=1)	74.2%	11.31
RPI (improved, mov-cutoff=3)	74.6%	11.43
RPI (improved, mov-cutoff=8)	73.4%	11.36
RPI (improved, mov-cutoff=12)	72.0%	11.79

Again, an MOV cutoff of 1 provides a (small) improvement in the MOV Error.

It turns out I lied in the last post when I said the MOV cutoff would be the last tweak we examined for RPI. In the next posting, I'll take a quick look at calculating RPI using a rolling window that only counts the last "N" games, to see if there's a usable "recency" effect in teams' performances.

Thursday, April 21, 2011

A Tale of Two Demons

We now turn our attention to one of the most vexing aspects of RPI, illustrated this season by the Tale of Two Demons: the first being the DePaul Blue Demons and the second being the Northwestern State Demons.

DePaul University finished the season 7-24 and a miserable 1-18 in conference. DePaul's wins included 4-25 Chicago St., 8-21 Northern Illinois and 9-21 Central Michigan.

In contrast, Northwestern St. finished the season 16-13 with a respectable 10-7 record in conference. They split home-and-home with Southland East Division champion McNeese St. and came within 1 point of winning the conference tournament and advancing to the NCAAs.

Yet curiously, DePaul has an RPI of 0.4590 and Northwestern State an RPI of 0.4566! How does this happen?

Recall the oft-cited formula for RPI:

RPI = (WP * 0.25) + (OWP * 0.50) + (OOWP * 0.25)

The biggest factor in this equation is the Opponents' Winning Percentage. But critically, the opponents' winning percentage is calculated from all of a team's opponents. So DePaul University benefits more from their 18 losses to strong Big East opponents than Northwestern State does from it's winning record in the Southland Conference.

No doubt when the NCAA concocted this aspect of the RPI formula, they were thinking of the case where a team has run up a good record against a bunch of patsies. In that case, the team's RPI gets docked because the OWP is low; and that makes sense because those are also (mostly) opponents the team has beaten. (The NCAA might also have been intentionally motivating teams to play strong out-of-conference schedules.) But it certainly seems counter-intuitive to give a team more credit for being beaten by good teams than for beating mediocre teams. And surely it makes for worse predictability.

Doesn't it?

Well, time to roll out the code and test. A reasonable first approach is to calculate OWP as the average of all the opponents a team actually beat, rather than all the opponents. (A similar reasoning applies to OOWP.) Let us try that approach using our current best RPI formula as well as the "Infinite Depth" RPI:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unw,15+15+70)	75.4%	11.49
RPI (unw,15+15+70,winners)	72.0%	12.20
RPI (infinite)	74.6%	11.33
RPI (infinite,winners)	74.0%	11.97

In both cases, this change makes for a worse predictor! That's hard to comprehend. Essentially, this says that losing to good teams makes a team more likely to win future games. While I can certainly come up with a rationalization for that (e.g., playing good teams makes you a better team, even if you lose), it's hard to put much faith in it. Still, the numbers don't lie.

As a second approach, we can break out the OWP for both the opponents we've beaten as well as the opponents that beat us. So we'll have a "Winners OWP" and a "Losers OWP". Instead of completely dropping the OWPs of the teams that beat us, we can include them as a less-weighted factor. Weighting the beaten opponents by about 3x the "lost to" opponents does improve performance over just the beaten opponents (to 369/11.47 for the unweighted, 15+15+70 version of RPI) but not enough to make it better than the standard versions. So (at least for this approach) we have to conclude that RPI may be right about the relative strengths of the DePaul Blue Demons and the Northwestern St. Demons. Apparently playing good opponents, even if you lose to them, makes you a stronger team.

We've about beaten RPI to death at this point, but we'll take a look at one more tweak before moving on to look at some other ratings that also make use of only won-loss records.

Wednesday, April 20, 2011

Infinitely Deep RPI

In the previous posting, we looked at extending the "depth" of RPI to an additional level -- that is, including OOOWP in our RPI calculation. In this posting, we'll look at extending RPI to "infinite" depth.

We can view the RPI for a team as having two components. The first is a measure of the team's strength, and the second is a measure of it's opponent's strength:

RPI = (Team's Strength) + (Team's Opponents' Strengths)

The standard RPI formula estimates the first term with WP and the second term with OWP and OOWP. But a better estimate of a team's strength is the RPI itself, so we could use that instead, at least for the opponents:

RPI = (Team's Strength) + (Average of Team's Opponents RPIs)

But now we have a weirdly self-referential equation: Calculating Duke's RPI will involve calculating UNC's RPI which will involve calculating Duke's RPI, ad infinitum.

We can solve this sort of self-referential formula iteratively -- that is, we can calculate an initial RPI estimate for all the teams and then repeat that process using the latest RPI estimates at each step until we reach an answer. There are two requirements for this to be successful. First, we need "bootstrap" values to get ourselves started, and secondly, we need to structure our formula so that the answers converge.

Conveniently for us, Andrew Dolphin has already done the hard work of determining a formula that meets those criteria:

RPI = (WP-0.5) + (Average of Team's Opponents RPIs)

so all we have to do is calculate. Plugging this formula into our framework and testing gives us:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unw,15+15+70)	75.4%	11.49
RPI (infinite)	74.6%	11.33

This gives us slightly better performance (in MOV Error) than the RPI that stops at OOOWP, suggesting that the standard RPI formula depth (to OOWP) is probably sufficient for our purposes.

Next we'll look at one of the most perplexing oddities of RPI.

Tuesday, April 19, 2011

New Data on Limits & Other Interesting Links

In a previous posting, I looked at the limits of prediction and concluded that the best performance we could hope for from a predictor would be in the 70-80% range for correct predictions. Today I happened to run across "The Prediction Tracker" which (amongst other things) tracks the performance of various college basketball rating systems as predictors of game performance. For the season that just ended, the best computer predictor belonged to Jon Doktor, and managed a "% Correct" measure of 73%. All of the predictors tracked by that site cluster in the lower end of the 70-80% range. Given that they include early season games, that's fairly solid performance. It's also interesting to note that (1) none of the predictors managed even a 1% advantage betting against the spread, and (2) all of them had MOV errors in the 9-10 point range. (Most of these predictors use margin of victory, so we would expect them to perform better on MOV than systems like RPI which use only win-loss.)

I got to the Prediction Tracker via the TeamRankings.com blog, which has a 4 part series discussing their rating systems starting here. The discussion lacks any concrete details on the algorithms but covers some interesting ground and is worth a look.

OOOWP

So far, we have explored various shortcomings of the RPI: home court advantage, averaging, and the distribution of the elements. We've found a few tweaks that have improved the performance of RPI as a predictor. We turn now to yet another potential area of improvement: the depth of evaluation.

Recall the (revised) formula for RPI:

RPI = 0.23*WP + 0.23*OWP + 0.54*OOWP

The last two terms of this formula can be thought of as a measure of a team's "Strength of Schedule" expressed as the winning percentage of a team's opponents and their opponents. RPI arbitrarily stops evaluating this "Strength of Schedule" term at two levels. Does extending this to more levels (e.g., OOOWP) add any predictive value?

The answer turns out to be yes and no. With a formula of approximately:

RPI = 7*WP + 7*OWP + 7*OOWP + OOOWP

we get a performance of:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unw,15+15+70)	75.4%	11.49
RPI (+oowp)	74.6%	11.36

This reduces the MOV Error but doesn't improve % Correct.

So extending the depth of RPI another step provides at least some value. This raises the natural question: is there value in extending it yet another step? ..and another step?

While we could certainly manually explore those possibilities by calculating OOOOWP, etc., it's perhaps better to cut to the chase and ask whether we can extend the depth of RPI infinitely, and see what predictive value that has. It may seem counter-intuitive, but it's possible to extend RPI to an "infinite" depth, but it requires a different computational approach.

Monday, April 18, 2011

RPI Distribution

In our previous ill-fated excursion triggered by Dick Vitale, we mentioned briefly the formula for combining the terms of RPI:

RPI = (WP * 0.25) + (OWP * 0.50) + (OOWP * 0.25)

This formula is pleasingly symmetrical but it isn't obvious at a glance why the terms are weighted as they are. Andrew Dolphin provides a (possible) explanation on his web page: Essentially, the numbers are chosen to make the formula the best approximation for an "ideal" RPI that went to an infinite depth, i.e., included terms for OOOWP, OOOOWP, etc. The "proper" weightings are determined by the ratio of conference games to non-conference games, and for basketball Dolphin gives the following ideal formulas:

RPI = 0.27*WP + 0.46*OWP + 0.27*OOWP
or
RPI = 0.23*WP + 0.23*OWP + 0.54*OOWP

The first is very close to actual RPI formula, so it's possible that the NCAA chose that weighting intentionally. As an experiment we can try the second alternative to see if that provides better performance:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unweighted)	74.6%	11.53
RPI (unweighted, 23+23+54)	75%	11.37

And indeed it does, improving in both metrics over the unweighted RPI. A quick hill-climbing experiment with other values for the distribution hits upon this alternative:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unweighted)	74.6%	11.53
RPI (unweighted, 15+15+70)	75.4%	11.49

which improves the number of correct predictions even further, at the cost of a slight degradation in MOV error. At this point, I'm favoring improvements that increase the % Correct over the MOV Error, so we'll take this as our new highwater mark.

Of course, since we're using the RPIs of both teams as inputs to a predictor, there's no reason that we have to combine the elements of the RPI at all. We can feed the elements of the RPI (e.g., the home team's WP, OWP and OOWP as well as the away team's WP, OWP and OOWP) directly into our predictor and let it choose the best weightings. This also has the advantage of not requiring the same weightings for both home and away. As it turns out in this case that doesn't result in signficantly better performance -- it drives the MOV error down slightly at the cost of % Correct -- but it's an option we can keep in our back pocket.

Sunday, April 17, 2011

The Prophet Dick Vitale

We'll return for a moment to the issue of home court advantage.

RPI normally adjusts for home court advantage by weighting home games differently than away games and we looked at adjusting by points. In both cases we found that adjusting for HCA actually worsened the performance of our predictor.

A radically different approach to dealing with HCA was suggested by a comment from the fount of all basketball knowledge, Dick Vitale. During the 2011 season, commenting on St. John's -- a team that was hard to beat at home but weak on the road -- he said:

"They're a different team at home, baybee!"

This keen observation suggests that we could account for the HCA by treating each team as two different teams: a "St. John's at home" team and a "St. John's on the road team". To do this, we calculate a "Home RPI" based upon the team's winning percentage at home, the team's opponents' winning percentage on the road, and the team's opponents' opponents' winning percentage at home. The "Away RPI" is calculated in the opposite manner.

Given that home teams win 2/3 of the games, you might expect teams to have better Home RPIs than Away RPIs. In fact, just the opposite happens. Recall that the formula for combining the parts of the RPI is:

RPI = (WP * 0.25) + (OWP * 0.50) + (OOWP * 0.25)

The OWP is the largest part of this calculation, and that's largest when we're looking at the opponents' home records, i.e., when we are calculating the "Away RPI". So in general, a team's "Away RPI" tends to be higher than its "Home RPI".

For the 2011 season, the top five away teams were:

Team	ARPI
San Diego St.	0.715
Kansas	0.710
Ohio St.	0.703
BYU	0.695
UNLV	0.668

Reviewing these teams' road records, this looks fairly accurate.

Unfortunately, breaking RPI down into home and away doesn't seem to have a lot of predictive value. Using the home team's "Home RPI" and the away team's "Away RPI", we get this performance:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unweighted)	74.6%	11.53
RPI (home-away)	69.0%	12.5

...considerably worse than our previous best performance. So Dickie V's advice turns out to be worthless. (But we shall return to consider Mr. Vitale's wisdom again in the near future, where it may prove more valuable.)

Saturday, April 16, 2011

Introduce Yourself

I see from the blog statistics that there's at least some folks reading this... Introduce yourself in the comments!

Averaging RPI

Previously, we looked at improving the the Ratings Percentage Index (RPI) by fixing its treatment of the Home Court Advantage (HCA). We found that the best results were had by eliminating the HCA adjustments. There are some other approaches we can explore to improve RPI's treatment of home court advantage, but we'll turn now to another area of possible improvement.

The RPI consists of the three terms: a team's winning percentage (WP), the winning percentage of the team's opponents (OWP), and the winning percentage of the team's opponents' opponents (OOWP). These latter two terms are defined in a curious way. They are not average values, but rather an "average of averages," e.g., OWP is computed by averaging the winning percentages of all the opponents. Suppose, for example, that UCLA plays three opponents: USC (4-1), Arizona (6-0) and Oregon (0-1). (USC and Arizona played in the preseason NIT.) OWP is calculated by averaging the WPs of these teams: (0.80+1.0+0)/3 = 0.60. In contrast, the average OWP is (10-2) = 0.83.

About this, Paul Kislanko says:

This would be equivalent to defining a batting average in baseball by the average of the BA for each game played. A 0 for 5 day followed by a 3 for 4 day would give (.000 + .750) = .375 instead of 3 for 9 = .333. In basketball, a player in a 3-game tournament who hits 2 of 10 shots, then 3 of 6, then 4 of 10 would have a shooting percentage of (.200 + .500 + .400)/3 = .433, when in fact for the tournament she was 9 for 25 = .360.

There's no other formula in all of sports statistics that makes this mistake.

As far as I know, there's no reason that the NCAA chose to use an average of averages in calculating the RPI. And I'm not aware of any reason why one method should be preferred over another, although Kislanko's argument is certainly compelling on its face. It's certainly worth investigating which method provides the best predictions.

If we substitute averaging into our RPI algorithm, we get this performance:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unweighted)	74.6%	11.53
RPI (unweighted+ave)	74.2%	11.63

This is worse performance, so the "average of averages" is the better choice (Kislanko's outrage over the poor math notwithstanding :-).

Another feature of how RPI calculates the OWP and OOWP is that if a team plays an opponent twice, that opponent's winning percentage is counted twice. This makes some sense -- certainly if we played two different teams with identical WPs we'd want to count them both when figure the average strength of our opponents. But perhaps it could be argued that playing a team a second (or third) time shouldn't affect the overall strength of your opponents. Again, it is easier to test than to worry too much about a rationale. If we eliminate duplicates opponents when calculating RPI (still using averages), we get this performance:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unweighted)	74.6%	11.53
RPI (ave)	74.2%	11.63
RPI (nodupes)	74.2%	11.59
RPI (ave+nodupes)	74.2%	11.63

No improvement over straight unweighted RPI.

Before we leave this topic, let's perform another experiment. Instead of averaging, we could try using the median value for the OWP and OOWP. Imagine a team whose opponents have records of 3-0, 0-1 and 0-1. The average of averages of these is 0.33; the average of these is 0.60; and the median of the averages is 0.00. We could certainly construct a rationale for why using the median might be a good idea, but again there's really no a priori reason to prefer one over the other. But it seems worth a quick experiment:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unweighted)	74.6%	11.53
RPI (medians)	72.8%	12.14
RPI (medians+nodupes)	71.4%	12.57

Both medians and medians without duplicates per above underperform the unweighted RPI. So it appears that averages are better than medians, and averages of averages the best of all.

The literature of sports ranking systems is full of long pages of carefully derived formula ensuring the best and most accurate math. But the most sophisticated math in the world is of little use if it does not contribute to improving performance. Taking the "average of averages" might not make any mathematical sense, but since it performs better than the (mathematically) superior alternatives, we're happy to use it!

Friday, April 15, 2011

The Home Court Advantage

In the previous post, we started looking at RPI and found that it was a considerably better predictor than the 1-Bit Predictor. However, RPI has several obvious shortcomings. Will fixing these improve its performance as a predictor? Let's see!

The first area we can look at improving is accounting for Home Court Advantage. Recall that previously we showed HCA to give the home team about a 4.5 point advantage, or overall a +30% chance of winning. In 1981, the NCAA added a correction to the RPI formula to account for this advantage. The correction weights a team's wins and losses differently depending upon where they were played. A home win is only worth 0.6 "wins", while a road win is worth 1.4 "wins". Conversely, a home loss costs 1.4 "losses", while a road loss is only 0.6 "losses".

There are a couple of potential problems with this approach. First, the RPI formula applies this weighting only to the winning percentage calculation of the team being rated. It is not used in calculating the opponents' winning percentage (OWP) or the opponents' opponents' winning percentage (OOWP). So the OWP and OOWP are potentially biased by the HCA. Second, the weighting chosen (0.6/1.4) doesn't appear to reflect the actual home court advantage, which is closer to 30% than 40%.

Let's see if changing the weightings in the calculation of WP to 0.7/1.3 (closer to the HCA I measured) results in any improvement. Making this change and testing gives this result:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI	73.2%	11.62
RPI (1.7/0.3)	73.4%	11.58

A very slight improvement. This is not too surprising -- this change only has a small effect on Winning Percentage, which is only 25% of a team's RPI. Perhaps the NCAA's approach to HCA doesn't have much impact at all?

I should have learned my lesson last time, let's pause a moment to run a test to make sure that the HCA really is a problem. To do this, we'll run a quick experiment using no weighting (e.g., 1.0/1.0) to see how much improvement this approach to HCA is actually providing. Performance with no weighting gives these results:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI	73.2%	11.62
RPI (1.7/0.3)	73.4%	11.58
RPI (1.0/1.0)	74.6%	11.53

Surprise! The NCAA's correction for the home court advantage seems to have actually made the RPI's performance worse. Hence the value of testing everything -- sometimes intuitively correct notions turn out to be incorrect. Further experimenting with a variety of weightings confirms that the unweighted RPI actually performs better than any weighted variety.

A different approach to accounting for HCA is to adjust game outcomes using the HCA in points. That is, we'll subtract the HCA (say, 4.5 points) from the home team before determining who "won" the game. So when Duke wins by 3 at home against North Carolina, that game will count as a win for North Carolina when calculating RPI. (And we'll carry this through all levels of the RPI calculation to avoid that possible shortcoming.) This is the equivalent of moving every game to a neutral site (but cheaper).

That change provides this performance:

Predictor	% Correct	MOV Error
1-Bit	62.6%	14.17
RPI (unweighted)	74.6%	11.53
RPI (HCA=4.5)	73.6%	11.61

This is not an improvement over RPI with no weighting. Experiments with other values for the HCA also do not improve performance (HCA=3.5 does the best, though). So it appears that the RPI does not benefit from adjustments to eliminate the HCA -- a somewhat surprising result!

My vague intuition about this result is that the HCA is essentially "washed out" of the RPI because the majority of teams play home-and-home series within their conferences. So any home advantage is gained equally by every team, and any attempt to compensate within the RPI formula just adds error.

We'll return in a bit to an alternative approach for HCA suggested by Dick Vitale. But since HCA doesn't appear to be a significant problem, first we'll detour into a couple of other possible improvements to RPI.

Thursday, April 14, 2011

The Ratings Percentage Index (RPI)

As mentioned previously, one of the simplest and most accessible pieces of information that we can use for prediction is a team's won-lost record. Naively, we might suppose that when Team A with a 6-2 record plays Team B with a 2-6 record, Team A will likely beat Team B.

But there's (at least) one significant problem with this supposition: we don't have any idea how Team A compiled its winning record, or Team B its losing record. It could well be that Team A is a Big East team that played 8 patsies at home, while Team B is a mid-major team that has played 8 road games against the best teams in the country, and managed wins against both Duke and Kansas. In that case we wouldn't be so certain that Team A could beat Team B.

The most widely known rating based on won-loss records is the Ratings Percentage Index (RPI). RPI tries to address the shortcoming of using won-loss records by rating each team not only by its winning percentage, but also by the winning percentages of its opponents. The assumption here is that opponents with good won-loss records are tougher opposition than those with poor records, so we should value wins (and losses!) against those opponents more highly.

Of course, you can extend this reasoning another level. A team's opponents have winning records -- so what? Again, we don't know if they compiled those records by playing good teams or bad teams.

And, in fact, the RPI addresses this concern by extending the rating another level, so that the RPI for a team is based upon:

The team's winning percentage (WP)
The team's opponents' winning percentage (OWP), and
The team's opponents' opponents' winning percentage (OOWP)

The RPI stops at this level, possibly because the NCAA had run out of the letter 'O'.

Previously we noted the significant impact of the home court advantage (HCA) on college basketball games. The RPI accounts for this, too, by weighting a teams home wins less than its road wins, and its road losses less than its home losses. The exact calculation of RPI is complicated, and I refer the interested reader to the Wikipedia article for a more detailed explanation. Studying that explanation for several days should lead to total enlightenment -- regarding RPI, anyway.

So how effective is RPI as a predictor? Using my standard methodology, I get this performance:

Predictor	% Correct	MOV Error
Naive	50%	14.5
1-Bit	62.6%	14.17
RPI	73.2%	11.62

which shows a significant improvement over the 1-Bit Predictor (+11% correct, -2.5 points error).

The following plot shows the RPI characteristics of the test data:

RPI Characteristics (Click to Enlargify)

In this plot, each point represents a game. The Y axis is the RPI of the home team, and the X axis is the RPI of the away team. The color of each point indicates the winner of the game -- red for a home win, blue for an away win. The diagonal line splits the field into games where the home team had the higher RPI (above the line) and games where the away team had the higher RPI (below the line). While there are more blue points below the line and more red points above the line, the correlation is not overwhelming.

But wait. I tricked you a bit back in the second paragraph of this posting, when I claimed that the won-loss record of a team isn't a good predictor because it doesn't take into account the quality of opponents. Is that true? I've always heard that claim, and it seems reasonable. But perhaps we should take a minute to check it. If we use winning percentages as inputs for a predictor, we get the following performance:

Predictor	% Correct	MOV Error
Naive	50%	14.5
1-Bit	62.6%	14.17
WP	72.4%	11.65
RPI	73.2%	11.62

Interesting! RPI is a better predictor than just the winning percentage, but not by a huge margin.

So RPI provides a signficant improvement in prediction over the 1-Bit Predictor. But there are several obvious shortcomings in RPI. Can it be improved? In the next few postings I'll examine the various shortcomings in RPI and perform various experiments to see if addressing these shortcomings improves performance. Eventually, we'll also consider other schemes that make use of only won-loss records and see if those provide any significant advantage over RPI.

Testing Methodology

We'll take a slight detour for this posting to discuss my testing methodology for these RPI experiments.

To begin with, I produce RPI values for both the home team and the away team for every game in the 2009, 2010 and 2010 regular seasons. (I eliminate the last 150 games of each data set to eliminate the Tournament games, NIT games, and the conference tournaments.) I also eliminate the first 1000 games in each season so that the RPI values are based on at least 5 games for each team. I also eliminate any non-Div I games.

For each game, the RPI value for both teams is calculated based upon all the previous games for that season.

The resulting data is fed into a RapidMiner process which calculates a Margin of Victory (MOV) for each game. A test set of 500 games is then split off. (The test set is chosen randomly, but is the same for every predictor tested.) The remaining games (approx. 10,000) are then used to train a linear regression using the MOV as the label.

The resulting linear regression is then applied to the test set of 500 games, and scored for RMSE and correctness of prediction.

More sophisticated regression models are available (e.g., neural networks, polynomial regression, etc.) but experimenting with the various possibilities showed that none of the more sophisticated models produced better results than a linear regression. (This is not too surprising, given that the input data is just the two RPI values.)

Wednesday, April 13, 2011

Sources of Information

So we want to predict the outcome of a college basketball game. Presumably we are going to base this prediction on (primarily) the previous performance of the two teams. (Methods based on astrological signs and comparison of mascots have typically performed poorly.) So what information on past performance can we use to drive our prediction?

The simplest and most fundamental information is the team's won-loss record. This has been the basis of a number of different rating systems that purport to determine whether one team is better than another (and in some cases, how much better). Best known of these rating systems it the Ratings Percentage Index (RPI), which plays a key role in the selection of teams to the annual NCAA tournament.

At a slightly more complex level, we can look at margin of victory. Instead of using only the fact that Team A beat Team B, we can add in the magnitude of the victory. We presume that margin of victory is a reasonable proxy for the relative strengths of the two teams. If Team A beats Team B by 24 points, while Team C beats Team B by only 2 points, we believe that Team A is stronger than Team C.

Finally, we can delve even deeper into the statistics of past games, looking at things such as offensive efficiency, rebounding, steals, etc. We can look at the statistics for individual players. We can also look at situational factors -- where was the game played? How long has it been since each team's last game? And so on.

Of course, there are problems and challenges with this information. Looking only at won-loss obscures a host of obviously important factors (e.g., who was the home team). Much of the information available may have no predictive value (e.g., knowing which team is the better offensive rebounding team may not help us predict the outcome of the game). So part of the challenge of building an effective predictor will be winnowing through the available information and gleaning the valuable bits.

We'll get started on that in the next posting.

As Good As We Can Do

The 1-Bit Predictor gives us a useful lower bound for prediction performance. Let's turn now to the other end: What's the theoretical "best performance" we can hope to achieve? Comparing RPI, Massey, Sagarin and LMRC predictions over six seasons of tournament games, [Sokol 2006] found performances in the 70-75% range. How much can we hope to improve that number?

We can think of a college basketball game as having both a deterministic and a random component. If the random component is zero, then there would be no variability in outcome -- every time two teams matched up (all other things being equal) the same result would occur. If the deterministic component is zero, then results would be completely random. Reality obviously lies somewhere in-between those two extremes.

By definition, there's no way to predict the random component of the outcome. If we assume that we can predict the deterministic component perfectly, our performance is then limited by the magnitude of the random component. So what is the magnitude of the random component?

There are a couple of different ways to explore this question. One thought experiment is to imagine a game in which there is no random component except in the the last possession of the game, which is completely random. Intuitively, that seems much less random than reality, so it provides a lower bound on estimating the magnitude of the random component. So how would that affect the final outcome?

On the last possession, the team with the ball can score 2 or 3 points (or even 4 points), or might turn the ball over leading to the other team scoring -- a potential swing of 6 or more points. So in this case, if we could predict the deterministic component of the game perfectly, we'd still have an average error of 3+ points.

Of course, in reality the last possession isn't entirely random. But more importantly, the first 120 possessions aren't entirely deterministic! This suggests that the best performance we can hope for is going to be significantly worse than +/- 3 points.

Another method to gain insight into this question is to look at repeat matchups of teams. Home and home conference matchups along with conference tournament matchups provide a lot of data that can be used to estimate the variability in college basketball games. For example, in the course of a month in 2011, Duke and UNC played home-and-home and an ACC tournament game with the following results (margin from Duke's perspective):

                Result
           @Duke        +6
           @UNC     -14
           @Neutral     +17

This shows an enormous amount of variability. Of course, there's a systemic bias in these numbers -- the home court advantage. Sagarin estimates that at about 4 points for 2011. If we factor that out, the results are:

                Result
           @Duke         +2
           @UNC     -10
           @Neutral     +17

which still suggests double-digit variability in game outcomes. Looking at all the home-and-home matchups for a season, [Sokol 2006] found that a team had to win by 21 points at home to have an even chance to win on the road. Part of that margin is due to home court advantage, but since most estimates of HCA are in the 4 point range, the rest of the margin is probably required to "overcome" significant variability.

These sorts of analysis suggest that the random component in game outcomes is at least +/- 8 points. So what does that say about trying to predict the outcomes of college basketball games?

Looking at 185 tournament games from 2009 & 2010 (both NCAA and NIT), the average margin of victory was about 11 points. 40% of the games were decided by 8 points or less. If we look at just our first performance metric (picking the correct winnger), we need only get the outcome correct (not the final margin). A predictor that accounts perfectly for everything except (say) 8 points of variability would get 60% of the predictions correct along with some portion (say 65%) of the remaining 40% -- for a final performance of ~85%.

In reality, of course, our predictor won't be perfect on the deterministic component of games, either. Taken all together, this suggests that a realistic upper limit for picking the correct winner of a game is in the 70-80% range. Since [Sokol 2006] showed that RPI and other schemes are already predicting in the lower part of this range, our progress is likely to be very incremental. Improvements of 1% will be significant progress!

On our other performance metric (predicting the MOV) the story may be better, but that's an analysis for another day.

Tuesday, April 12, 2011

The 1-Bit Predictor

In the previous posting, we established a bottom threshold for predictor performance: 50% correct and about 14.5 points error in the MOV. Of course, that "predictor" doesn't use any information at all about the game, so it's mostly a curiousity. It's more interesting to ask how well we can do with a very simple predictor that actually uses information about the game. It turns out that with the smallest amount of extra information we can do much better.

Information Theory wonks define the smallest unit of information as a bit -- essentially the answer to one yes-no question. (Disclaimer: yes, I know the real definition is more complex :-) Let us suppose, then, that we can get the answer to one yes-no question about a basketball game. What question should we ask, and how much can we improve our prediction based upon that information?

Basketball aficionados will know that "home court advantage" (HCA) is a major factor in college basketball. We will examine HCA in more detail soon, but for now let us use our one bit of information to determine the home team. How much does knowing the answer to that question improve our prediction?

It turns out that in college basketball, the home team wins an astonishing 66% of the games, and outscores the visiting team by an average of 4.5 points. We can use this information to create our "1-Bit Predictor":

The home team will win by 4.5 points.

This predictor gets 66% of its games correct with an error of about 13.5 points over all games in the 2009-2011 seasons. So with that one bit of information we've improved our predictor by 32% on one performance metric! (But only by about 7% on the other metric, which will prove a tougher nut to crack.)

Predictor	% Correct	MOV Error
Naive	50%	14.5
1-Bit	62.6%	14.17

(For comparison purposes to later predictors, the performance I show in this table corresponds to a standard testing methodology, to be explained shortly.)

So the "1-Bit Predictor" provides a reasonable lower bound on prediction performance. This may seem like a trivial result, but consider that [Sokol 2006] compared four rating systems and the Las Vegas betting line over six seasons of tournament games and found that they picked the correct winner in 70-75% of the games. The 1-Bit Predictor is already within a few percentage points of these much more sophisticated systems. If nothing else, this suggests that improving prediction performance is going to be a difficult task.

Having established a reasonable lower bound, the obvious next question is "What is a reasonable upper bound for prediction performance?" That is the topic of the next posting.

Monday, April 11, 2011

You can't improve what you don't measure

My day job involves a lot of process improvement work, and one of our catch-phrases is "You can't improve what you don't measure." The oft-unstated corollary is that what you choose to measure determines what you'll improve. Measure the wrong thing and you'll find yourself optimizing the wrong thing. In our quest to create a good basketball predictor, what should we take as our measure of performance?

The Predictive Analytics Challenge (as well as countless office pools) takes as a measure accuracy in predicting the NCAA tournament. That makes for an interesting challenge (and interesting office discussions) but has a few problems as a metric. First, the sample size for testing is rather small -- only 63 (or so) games a year. Second, picking all the games before any have been played and scoring different rounds with different values introduces a host of strategic complications. Finally, unlike 95% of the college basketball games, the tournament is played on a neutral court.

For these reasons, I prefer to measure the predictor against individual regular season games. Obviously I'll also use it to try to predict the Tournament -- I just won't measure its performance against Tournament games.

So how should we measure the performance of our predictor? The obvious (and simplest) measure is whether it predicts the correct outcome. That's a good metric, but it does have some flaws. For one thing, predicting the correct outcome of many games is trivial. When Duke plays Wake Forest, it isn't too difficult to predict with some confidence that Duke will win. Secondly, it's really only useful for entering Tournament contests.

A second measure we can use is to try to predict the Margin of Victory (MOV) and measure how close we got. This measure makes predicting the Duke-Wake Forest matchup more interesting -- Duke is very likely to win, but by how much? It's also useful if we want to match our predictor against the Las Vegas bookmakers, who release a "line" on every game that represents their best prediction for Margin of Victory. Given the strong financial motivation the bookmakers have to be good predictors, they should be a good test of our predictor.

(Strictly speaking, the bookmakers may not set the line to their best prediction of MOV. They may set or move the line to equalize betting on both sides of the game to minimize their financial risk.)

I will use both measures to assess the performance of the predictor. I've assessed a number of prediction models with both metrics, and it's almost always the case that optimizing one measure tends to optimize the other. In some cases that may not be true, and I'll rather arbitrarily weigh the trade-off and pick one over the other.

Now that we've established our metrics of performance, let's think about how good our predictor can be. Actually, let's start off by thinking about how bad our predictor can be.

If we know absolutely nothing about a game and randomly choose one of the two teams to win, we will predict the correct team 50% of the time. And, as it happens, the average MOV is about 15 points. So that sets a lower bound on prediction:

Predictor	% Correct	MOV Error
Naive	50%	14.5

(If you have a predictor that does worse than that, take the opposite of it's predictions and you'll have a better predictor :-)

Interestingly, with a tiny bit more information (and I mean that literally), we can do much better. That's a topic for the next posting.

Sunday, April 10, 2011

Introduction

As a lifelong college basketball fan and a student of Artificial Intelligence (AI), I was intrigued in 2010 when I saw Danny Tarlow's call for participants in a tournament prediction contest. I put together a program and managed to tie Danny in the Sweet Sixteen bracket.

The program I wrote in 2010 used a genetic algorithm to evolve a scoring equation based upon features such RPI, strength of schedule, wins and losses, etc., and selected the equation that did the best job of predicting the same outcome as the games in the training set. I felt the key in winning a tournament picking contest was in guessing the upsets, so I added some features to the prediction model intended to identify and pick likely upsets.

After the tournament ended, I remained intrigued by the problem of predicting basketball games, and continued to work on my program. In 2011 I did a little better, winning both brackets in Danny's contest. During the end of the college basketball regular season, I also tested my program against the Bodog lines, and showed a significant profit over about six weeks of betting.

In the final days before the tournament this year, I discovered some glaring problems in my picking program (and my testing methodology). I wasn't able to address those problems before the deadline for Danny's contest this year, so I am in the process of re-writing my picking program and examining the various issues and possible approaches.

Shortly we'll begin an in-depth look at the "Ratings Percentage Index" aka RPI but first it will be helpful to discuss a few broader issues, such as the limits on prediction.