Wednesday, October 31, 2012

Recent Papers

I took some time out recently to read through some of the basketball prediction papers from this year's MIT Sloan Sports Analytic Conference.  Here are some thoughts...
Insights from the LRMC Method for NCAA Tournament Prediction
Mark Brown, Paul Kvam, George Nemhauser, Joel Sokol
MIT Sloan Sports Analytics Conference 2012

The latest paper from the LRMC researchers compares the performance of LRMC to over 100 other ranking systems as reported by Massey here.  The measure of performance used is correct prediction of the NCAA tournament games.  LRMC out-performs all of the other rankings, getting 75.5% correct over 9 years.  The next best predictor did 73.5%.  (I don't optimize my predictor on this metric, but it also gets about 73.5% correct.)

The LRMC work is always interesting and well done.  A couple of notes that pop to mind:

(1) The advantage LRMC has over the other models is not huge.  LRMC gets 75.5% correct; the 20th ranked model gets about 72% correct -- the difference is about 3 games per tournament.  That's certainly significant, but in a test set of only 600 games, it may not be that significant.  One very good year (or one very bad year) could move a rating significantly.  It would be interesting to see the year-to-year performance of the ratings, but the authors don't provide that information.

(2) The authors assume there is no home court advantage (HCA) in the NCAA tournament and simply predict that the higher-rated team will win.  In my testing, including an HCA for the higher-seeded team improves prediction performance.  For example, this paper reports the performance of RPI as about 70% in predicting tournament games.  In my testing, RPI with HCA predicted about 73% correctly.  So the results may be skewed depending upon how much effect HCA has on each prediction model.  (The authors don't use HCA for LRMC, so that model might do better as well.)

(3) In this paper, the authors test against all the matchups that actually occurred in the tournament -- that is, they do not "fill out a bracket" and commit to game predictions at the beginning of the tournament.  In 2011, LRMC was included in the March Madness Algorithm Challenge and finished quite poorly -- outscored by all but three of the other entrants.   (A similar result can be seen here.)  Taking a look at the LRMC bracket for 2012 (here), LRMC got 22 correct picks out of the initial 36 games -- and got only one of the three play-in games correct, missed all of the upsets, and predicted two upsets that did not occur.  Eight of the entries in the algorithm challenge picked more first-round games correctly.  In fact, LRMC's only correct predictions in the entire tournament were higher seeds over lower seeds.  And once again it would have lost the algorithms challenge.

(4) My own attempts to implement LRMC and use it to predict MOV (found here) have performed more poorly (around 72%) than the authors report in this paper.  It may be that my implementation of LRMC was faulty, or that LRMC happened to perform slightly worse on my test data than on the tournament games used in this paper.

Moving on from the performance of LRMC, there are a couple of other interesting results in this paper.  One is that  home court advantage does not vary substantially from team to team.  This confirms my own experiments.  (I don't think I've reported on those experiments -- perhaps I'll write them up.)  A second is that the natural variance in games is around 11 points, which matches closely what I've found.  The last is that the authors found that the cliche "good teams win close games" doesn't seem to have any validity.

Can Statistical Models Out-predict Human Judgment?: Comparing Statistical Models to the NCAA Selection Committee
Luke Stanke
MIT Sloan Sports Analytics Conference 2012
As with the LRMC paper, this paper looks at predicting tournament game outcomes. In this case, the author compares the NCAA committee seedings and RPI ratings to four different Bradley-Terry models.  For more on Bradley-Terry models, see here.

Stanke reports the results of testing these models against (approximately) the same games used in the LRMC paper:

The highest performing models are the Bradley-Terry models using only win/loss data. These two models correctly predicted approximately 89% of games in the NCAA tournament games from the past eight seasons. The next group of models is the Bradley-Terry Models using points a method for ranking teams. These models predicted over 82% of games correctly. The third group is the alternative models, the Committee Model, The RPI Model, and the Winning Percentage Model. These models range from 69.1% of games correctly picked to 72.9% of games correctly picked.
This is certainly an interesting result -- particularly in light of the claims of the LRMC paper.  According to the LRMC authors, LRMC's 75.5% success rate out-performed over 100 other rankings from Massey's page, and the Vegas line's success rate of ~77% is an upper-bound to performance. 

So what explains this disparity?  I didn't know -- so I sent off an email to the author.   Luke Stanke emailed me to say that the result was caused by a coding error, and that actual performance was around 72%.   (I know all about coding errors... :-)  So his results here are in line with the expected performance for Bradley-Terry type rating systems.  His conclusion remains unchanged -- that computer rating systems are better than the committee at selecting and seeding the tournament, and Bradley-Terry would be better than RPI.  I won't disagree with either conclusion! :-)

Using Cumulative Win Probabilities to Predict NCAA Basketball Performance
Mark Bashuk
MIT Sloan Sports Analytics Conference 2012
Bashuk lists his affiliation as "RaceTrac Petroleum" so like me he appears to be an interested amateur in game prediction.  In this paper he describes a system that uses play-by-play data to create "Cumulative Win Probabilities" for each team, and eventually, a rating.  He uses this rating to predict game outcomes, and for the 2011-2012 season correctly predicts 72.6%.  In comparison, Pomeroy predicts 77.7% correctly and the Vegas Opening Line 75.2%.

It is unclear to me after reading the paper exactly how CWP and the ratings are calculated.  However, unlike most authors, Bashuk has made his code available on the Web.  (URLs are provided in Appendix 1 of the paper.)  This is very welcome to anyone trying to reproduce his results.  Unfortunately for me, Bashuk's code is in SQL, which I don't understand well.  So poring through it and understanding his process may take some time.

Thursday, October 25, 2012

A Detour into RapidMiner

A chunk of visitors to this blog find it looking for RapidMiner, so I thought I'd take a detour to explain the RapidMiner process I'm using to explore early season performance.  This RapidMiner process uses training data to build a model, applies the model to separate test data, and then measures performance.  This is something of a sequel to the post I did for Danny Tarlow over at his blog.  Hopefully it will be useful to some folks as an example of how to put together a more complex RapidMiner process, as well as how to apply a model to test data, which wasn't covered in the previous post.

(Reminder: RapidMiner is a free data-mining tool that you can download here.)

The (unreadable) graphic above illustrates the entire process.  There are four parts to this process.  In Process 1, the training data is read in and processed.  In Process 2, the test data is read in and processed.  In Process 3, the training data is used to build a linear regression and then the model from that regression is applied to the test data.  In Process 4, the results are processed and performance measures calculated.  I'll now go into each process in detail.

The graphic above shows Process 1 in more detail. It's a straightforward linear flow starting at the upper left and ending at the lower right.  The steps are:
  1. Read CSV -- This operator reads in the training data, which is simply a large text file in comma-separated value (CSV) format, with one line for every game (record) in our training data set.
  2. Generate ID -- This operator adds a unique ID attribute to every record in our training data set.  (We'll see later why it is useful to have a unique ID on every record in the data set.)
  3. Rename by Replacing -- This operator is used to rename attributes in the data set.  In this case, I use it to replace every occurrence of a dash (-) with an underscore ( _ ).  Dashes in attribute names are problematic when you do arithmetic on the attributes, because they get mistaken for minus signs.
  4. Generate Attributes -- This operator generates new attributes based on the existing attributes.  In this case, I calculate a new attribute called "mov" (Margin of Victory) by subtracting the visiting team's score from the home team's score.
  5. Set Role -- Most attributes are "regular" but some have special roles.  For example, the ID attribute generated in step 2 has the "id" role.  Here I use the Set Role operator to set the role of the "mov" attribute to "label."  This role identifies the attribute that we are trying to predict.
  6. Read Constructions -- You can use the Generate Attributes operator to generate new attributes, but that's not convenient if you want to generate a lot of new attributes, or if you want to generate new attributes based on some external inputs.  In my case, I have generated and tested many derived statistics, and entering them manually into "Generate Attributes" was not feasible.  The Read Constructions operator reads formulas to generate new attributes from a file and creates them in the data set.  Using this, I was able to have a Lisp program create a (long) list of derived statistics to test, write them to a file, and then have the RapidMiner process construct them automatically.
  7. Replace Missing Values -- This is the first of several data cleanup operators.  There shouldn't be any missing values in my data sets, but if there is, this operator replaces the missing value with the average over the rest of the data.
  8. Replace Infinite Values -- Some of the constructions in Step 6 can result in "infinite" values if (for example) they cause a divide by zero.  These two operators replace positive infinite values with 250 and negative infinite values with -250.
  9. Select Attributes -- The last operator in this process removes some attributes from our data.  In particular, we don't want to leave the scores in the data -- because the predictive model will (rightfully) use those to predict the MOV.  (The MOV itself is not a problem, because it has the "label" role.)  We also remove a couple of other attributes (like the team names) that would cause other problems.
So at the end of this process, we have read in the training data, processed it to contain all the attributes we want and none that we don't want, and cleaned up any inconsistencies in the data.

Process 2 is exactly the same as Process 1, except it is applied to the test data.  It's important to ensure that both the training data and the test data are treated identically.  If they aren't, you'll get misleading results or cause an error later in the process.  (I should point out that RapidMiner can bundle up a process into a sub-process and reuse it in multiple places, and that's probably what I should do here.)

The graphic above shows Process 3.  I've left in the "Select Attributes" from the end of Process 1 and Process 2 for context.  Here are the steps in Process 3:
  1. Linear Regression -- RapidMiner offers a wide variety of classification models that can be used for prediction.  In this case we're using a linear regression.  The input to this operator is the training data, and the output is a model.  This operator trains itself to predict the "label" attribute (MOV in our case) from the regular attributes.  The model it produces is a linear equation based upon the regular attributes.  In my process here, I'm training the model every time I run the process.  It's also possible to train the model once, save it, and re-use it every time you want to test or predict.  In my case, I tweak the data and/or process almost continuously, so it's easiest just to re-train every time.  There are about 15K records in the training data set, and the Linear Regression takes a couple of minutes on my laptop.  Other classification operators are much slower, and re-training each time is not feasible.
  2. Apply Model -- This operator applies the model from step 1 to the testing data from Process 2 and "labels" it -- that is, it adds a "prediction(mov)" attribute that has a predicted Margin of Victory for the game.
  3. Join -- This operator "joins" two data sets.  To do this, it finds records in the two data sets that have the same ID and then merges the attributes into a single record.  (Now we see why we need a unique ID!)  The two data sets being merged here are (1) the labeled data from the model, and (2) the original data from the Select Attributes operator.  Recall that the Select Attributes operator is used to remove unwanted attributes from the data, including the team names and scores.  So the labeled data coming out of the model does not have that information.  However, to evaluate our predictive performance we need the scores (so we can compare the actual outcome to the predicted outcome) and it would be nice to have team names and dates on the data as well.  So this Join operator puts those attributes back into our data.  In general, this is a useful technique for temporarily removing attributes from a data set.
At this point, we have a data set which consists of our test data with an added attribute "prediction(mov)"containing the predicted margin of victory for each game.  Next we want to see how well our model performed.

The graphic above shows Process 4.  I've left in the "Join" from Process 3 to make it clear where it connects.  Here are the steps to Process 4:
  1. Rename -- The first step is to rename the "prediction(mov)" attribute to "pred".  The parentheses in this name can confuse some later processing, so it's best just to remove them.
  2. Generate Attributes -- Next we generate a new attribute called "correct".  This attribute is 1 if we've correctly predicted the winner of the game, and 0 if not.  RapidMiner provides a powerful syntax for defining new attributes.  In this case, "correct" is defined as "if(sgn(mov)==sgn(pred),1,0)" -- if the sign of our predicted MOV is the same as the sign of the actual MOV, then we correctly predicted the winner.
  3. Write Excel -- At this point, I save the results to an Excel spreadsheet for later reference and processing (e.g., to produce the graphs seen here).
  4. Multiply -- I like to look at two different measures of performance, so I create two copies of the test data.  This isn't strictly necessary in this case (I could chain the two Performance operators) but this is another example of a useful general technique.
  5. Performance -- RapidMiner provides a powerful operator for measuring performance that can assess many different measures of error and correlation.  In the top use of Performance in this process, I use the built-in "root mean squared error" measure and apply it to the predicted MOV to calculate the RMSE error.
  6. Aggregate / Performance -- The second measure of performance I like to look at is how often I predicted the correct winner.  (I might prefer a model that predicts the correct winner more often even if it increases the RMSE of the predicted MOV.)  I want to know this number over the entire data set, so the first step is to Aggregate the "correct" attribute.  This produces a new attribute "sum(correct)" which is the number of correct predictions over the whole data set (and has the same value for every record in the data set).  This is then reported by the Performance operator as a performance measure.  The Performance operator isn't strictly necessary in this situation -- I could just report out the "sum(correct)" value -- but in general marking this as a measure of performance allows me to (for example) use the value to drive an optimization process (e.g., selecting a subset of attributes that maximizes the number of correct predictions).
And that's "all" there is to it.  One of the advantages of RapidMiner is that the graphical interface for building processes let's you quickly lay out a process, as well as easily modify it (such as switching the Linear Regression to an SVM, for example). 

Wednesday, October 24, 2012

A Closer Look at Early Season Prediction Performance

In the previous post, I looked at predicting early season games using my standard predictive model and found that performance was (understandably) much worse for the early season games where teams had no history of performance than in late season games, where we had the whole season's history to help guide the prediction.  I also looked at using the previous season's games to "prime the pump" and found that improved performance considerably.  In this post, I'll take a closer look at those two cases.

The graph above plots the prediction error for a moving twenty game window throughout the first 1000 games of the season.  (Note #1: The twenty game window is arbitrary -- but the data looks the same for other window sizes.  Note #2: This drops the first game for every team.  The model predicts a visiting team win by 224 points for those games, which greatly distorts the data.)  The green line is a linear regression to the data.  The prediction error starts out high (15+) and drops steadily throughout the 1000 games until at the end, it is close to the performance of the model for the rest of the season.

(There are some interesting aspects to this graph.  For example, much of the error seems to be driven by a few games.  For example, the peak at around 225 games is driven largely by two matchups: Georgetown vs. NC Greensborough and Colorado State vs. SMU.  In both cases, the predictor has an unrealistic estimate of the strength of one or both of the teams.  So it might be that we could greatly improve prediction by identifying those sorts of games and applying some correction.  A possible topic for another day.)

A logarithmic regression suggests that much of the error is eliminated during the first 500 games:

If nothing else, this plot suggests that even with no other measures, our predictions should be pretty good after about the 500th game.  Now let's take a look at a similar plot for predictions where the teams have been primed with the earlier season's games:

Huh!  The use of the previous season's games pins the predictive performance to about 12 RMSE.  It's easy to understand why.  The previous season's performance has decent predictive power -- certainly better than no data at all -- but swamps the current season's performance, preventing the predictor from improving.  Even by the end of the 1000 game period, most teams have only played 5 or 6 games.  The previous season's 30+ games simply out-weigh this season's games too much to let the performance improve.

We can plot the two trendlines to see where it stops paying off to use the primed data predictions:

The cutoff is around 800 games (if we include the first game for every team).  We can combine these two into a predictor that gradually switches over from one predictor to the other over the first 800 games.  That predicts games with about the same error rate as using the previous season's data -- the last 200 games are predicted better, but not enough to substantially move the average.

More to come.

(Incidentally, this is the 100th blog posting!)

Thursday, October 18, 2012

Early Season Predictions, Part 2

As mentioned previously, I'm using this time before the college basketball season gets going thinking about how to predict early season games.  In the early season, we're missing two elements needed for good predictions:  (1) a meaningful statistical description of each team, and (2) a model that uses those statistics to predict game outcomes.  By the end of the season we have both things - a good statistical characterization of each team as well as a model that has been trained on the season's outcomes.  So how do we replace those two elements in the early season?

Replacing the model turns out to be fairly easy, because the factors that determine whether teams win or lose don't change drastically from season to season.  When you try to predict the tournament games at the end of the season, a model trained on the previous season's games does nearly as well as a model trained on the current season's games.  Of course, if the current year happens to be the year when the NCAA introduces the 3 point shot, all bets are off.  Still, in my testing the best performing models are the ones trained on several previous years of data.  So in the early season we can expect the model from the previous season to perform well.

(You might argue that early season predictions could be more accurate with a model specifically trained for early season games.  There's some merit to this argument and I may look at this in the future.)

Replacing the team data is not so easy.  The problem here is that teams have played so few games (none at all for the first game of the season) that we don't have an accurate characterization of their strengths and weaknesses.  Even worse, many of the comparative statistics (like RPI) rely on teams having the same opponents to determine the relative strength of teams.  In the early season, the teams don't "connect up" and in some cases, play few or no strong opponents.  So how bad is it?  I tested it on games from the 2011-2012 season:

  Predictor    % Correct    MOV Error  
Late Season Prediction72.3%11.10
Early Season Prediction71.3%15.06

So, pretty bad.  It adds 4 points of error to our predictions.  Since we've been groveling to pick up a tenth of a point here and there, that's a lot!

The obvious proxy for the team data is to use the team data from the previous season.   Clearly this has problems -- in college basketball team performance is highly variable season to season -- but it's at least worth examining to see whether it does improve performance.  In this experiment, I used the entire previous season's data to "prime the pump" for the next season.  In effect, I treated the early season games as if they were being played by the previous year's team at the end of the previous season.  Here are the results:

  Predictor    % Correct    MOV Error  
Early Season 71.3%15.06
Early Season (w previous season) 75.5%12.18

A fairly significant improvement.  Is there anything we can do to improve the previous season's data as a proxy for this season?  We'll investigate some possibilities next time.

Thursday, October 11, 2012

2012-2013 Schedule Now Available

A quick FYI for the other college basketball predictors out there:  Yahoo Sports has now posted the schedule of upcoming games for next year.  As reported last time, the scores from past seasons remain broken, but at least the upcoming games are now available.

I've updated the Data page to include links to the 2012-2013 schedule as well as the current conference affiliations.