Wednesday, August 31, 2011

Prediction by Similarity

Well, I've been all over the map with my blog postings lately and this one is no exception.  I'm going to post today about what I call "prediction by similarity."  I don't know what it's really called by the machine learning community -- back in the day when I did artificial intelligence we would have called it case-based reasoning and for all I know that may still be perfectly fine nomenclature.

The basic idea (as it applies to college basketball) is that if I want to know how Illinois @ Michigan State is going to turn out, I can look at historical examples of similar games and see how they turned out.  Hopefully they'll be a good guide for predicting the Illinois @ Michigan State matchup.

The first challenge in this method is to find similar games.  The easy solution would be if Illinois had already played (say) five games at Michigan State this season.  Presumably those results would be an excellent guide to how the current game would turn out.  Of course, it's never the case that we have five previous matchups to look at.  Even when the two teams have played before, it's usually the other end of a home-and-home series, and because of the impact of home court advantage, it's hard to use even that.  On the other hand, there are 15K games in my corpus -- and about twice that if I include the 2006-2008 seasons as additional historical examples.  Out of 30,000 games, I ought to be able to find a few pretty similar to Illinois @ Michigan State.

About our only option for measuring the similarity of two games is to compare the season statistics (including things like ratings) of the four teams.  If the two home teams have very similar statistics, and the two visiting teams have very similar statistics, we might have some hope that the past game can be used to predict the current game.  So what statistics are important?  I have no idea (and as far as I know, no one has looked at the question) but we can make some educated guesses.  The relative strengths of the teams is probably important -- that is, we want to look at past games where the home team was about as strong as MSU and the visiting team as strong as Illinois.  The absolute strengths might be important, too.  If we use (say) TrueSkill as our strength rating, a matchup between an 850 team and an 800 team might be quite different than a matchup between a 450 team and a 400 team.  (Or maybe not.)  Pace of play might be important -- a fast 850 team playing a fast 800 team might have a different result than a fast 850 team playing a slow 800 team. In fact, I can probably make an argument for just about any statistic as being important.

As usual, I intend to compensate for my lack of knowledge with persistence and processing power.  I'll try different combinations until I find one that works well or convince myself that further search won't be fruitful.  Unfortunately, this search will take a lot more processing power.  By the end of the 2011 season, I have to rate and sort 30K games for each predicted game.  That goes a lot slower than (say) just updating RPI.  It takes several hours to make one run.

Let's take a look at what sort of "similar" games we find with a similarity function that uses TrueSkill, average points per possession (PPP), average points per possession allowed (PPPA), and average number of possessions (POSS):

SimilarityDateHomeTrueSkillPPPPPPAPossAwayTrueSkillPPPPPPAPossMOV
0.001/24/2011Pittsburgh104.71.210.9266.4Notre Dame70.611.130.9866.7-5
0.183/25/2007Florida93.461.190.9168.0Oregon71.661.120.9866.88
0.203/21/2009Connecticut104.31.230.9268.7Texas A&M68.391.070.9866.926
0.221/29/2009UCLA91.31.160.9365.8California70.501.120.9767.815

Our current game (the one we're trying to predict) is in the first row: the 1/24 conference game between Pittsburgh and Notre Dame.  The next three lines are the closest matches found.  In the closest match, Oregon's statistics are amazingly close to Notre Dame's.  Florida's aren't quite as close a match to Pittsburgh -- notably, Florida's strength (as measured by TrueSkill) is significantly less.  In all of the similar games, the home team wins handily -- we would predict Pittsburgh by 16 points based upon these games.  And that probably wouldn't be a bad prediction: Notre Dame's victory was considered a significant upset.

So how well does this work as a predictor?  So far, not very well.  About the best performance I've found is a MOV Error of 12.5 and a % Correct of about 68%.  There are a couple of positives: I'm still working on the code, so I may find some bugs or improvements.  And secondly, the predictions from this model are only correlated about 60% with (say) TrueSkill, which may make it useful as part of an ensemble model.

1. Scott,

I stumbled on your blog last week, and I've enjoyed the accounts of your journey to better NCAAB handicapping. Keep up the great work!

I plan to post several comments, but I'll start here....

How did you compute your similarity scores? As you note, it's generally a hard problem to pick the appropriate feature vectors and distance metric.... I'm sure you've tried k nearest neighbors and related similarity search algorithms (if so, please consider adding a blog). Any luck there? Any luck with different distance metrics?

I'd suggest viewing this as a clustering problem, and I'd take a look at some manifold learning approaches. Email me if you'd like some suggestions for reference papers, and maybe some helpful code.

Thanks,
Andy

2. Andy -- Thanks for the comment.

I actually started out with my own implementation of this, doing a similarity measure as shown above. I later figured out that I could use k-NN to achieve the same results and switched over to that. I've actually used it mostly for what I'm calling statistical prediction. I have some semi-interesting results there -- I'll try to do a posting about it... (I'm a little swamped for time recently.)