Friday, January 20, 2012

The Continued (Slow) Pursuit of Statistical Prediction

When we last met on this topic, I was inspired by the Four Factors to look at derived statistics created from the ratio of two existing statistics, e.g.,

Offensive Balance = (# 3-Pt Attempts) / (# FG Attempts)

My previous work in this area has convinced me of the value of looking at all possibilities, no matter how non-intuitive, my approach was to look exhaustively at all the possible ratios between the ~35 base statistics.  That leads to some crazy statistics such as:

(Average # of fouls per possession by the home team's previous opponents) / 
    (Average offensive rebounds per game by the home team in previous games)

There turn out to be a number of difficulties with this approach.  (Perhaps not unsurprisingly, although crazy nonsensical statistics are not one of them.)

First, it's a lot of work just to generate the 1060 derived statistics.  (Only 1060 because I avoided inverse ratios, and avoided "cross-ratios" between the two teams.)  Initially I was generating a subset of these from within the Lisp code that pre-processes the game data.  That was painful to set up and slow to execute.  Eventually I discovered a way to generate the derived statistics within RapidMiner. I was able to drive this from a data file, so I wrote a small Lisp program to generate the data file that RapidMiner could use to construct all 1060 derived statistics.

Second, this amount of data tended to overwhelm my tools.  With the derived attributes, each game has about 1200 attributes total.  My training corpus has about 12K games.  The combination tended to break most of the data modeling features of RapidMiner, usually by overwhelming the memory capacity of Java.  Even when the software was capable of handling the data volume, operations like a linear regression might take hours (or days!) to complete, so testing and experimenting was laborious at best.

One way to reduce this problem is to thin the dataset, by testing on (say) a tenth of the full corpus.  But that introduced a new problem: overfitting.  If I used (say) a tenth of the data, I had about 1200 games in my test set -- just about the same number of test games as attributes.  The result of that is almost invariably a very specific model that does extremely well on the test data and very poorly on any other data.

Another approach is to thin the attributes.  This is the feature selection problem.  The idea is select the best (or at least some reasonably good) set of features for a model.  The stupid (but foolproof) way to do this is to try every possible combination of features, and select the best combination.  But of course that's infeasible in many cases (such as mine), so a variety of alternative approaches have been created.  RapidMiner has some built-in capabilities for feature selection, and there's a nice extension to add more feature selection capabilities here.

I experimented with a variety of different feature selection approaches.  I was hopeful that different approaches would show overlap and help identify derived attributes that were important, but for the most part that did not happen.  However, taking all the attributes recommended by any of the feature selection approaches did give me a more reasonable sized population of derived statistics to test.

More on this topic next time.

Tuesday, January 17, 2012

Basketball Season Underway

I spent the last few days scraping game data, dusting off code and generally getting the basketball predictor back online.  The current version of the predictor uses an average of 4 linear regressions.  These models are based upon: (1) the Govan rating, (2) the TrueSkill rating, (3) a Batch Gradient Descent (BGD) rating, and (4) a rating based on a wide variety of statistical measures (such as "offensive rebounds per possession").   Individually, each of these models has a RMSE of less than 11 on my test corpus.   Unfortunately, they're all highly correlated, so the combined model doesn't do any better than the best of the underlying models.  Currently it has an RMSE of 10.79 on my test corpus.

During the season I compare the model predictions against the line and "bet" games where the prediction differs significantly from the line.  "Significantly" is a relative term.  When I first started doing this, my model often differed from the line by 10 points or more.  As the model has improved, those differences have narrowed considerably.  (As would be expected.  The line is usually the best predictor.)  In my testing so far this year, I've only seen a difference of more than 5 points once.  There is some good mathematical work on sizing wagers based upon bankroll, perceived advantage, etc., but I've gone to a simple approach of betting $10 with an advantage of < 5 points and $20 with an advantage of >5 points.  (Adopted after the 1/14 games shown below.)

Here are the games the model has "bet" so far (no real money was harmed):

Date Home            Score Away                   Score MOV Line Pred Adv Risk Win Result Won v.Line
1/14 Tennessee St. 52 SIU Edwardsville 49 3 16 8.8 -7.2 20 17.39 17.39 1 1
1/14 LA Lafayette 87 Florida Intl. 81 6 10 5.1 -4.9 20 19.05 19.05 1 1
1/14 Murray St. 81 Tennessee Tech 73 8 12 16.5 4.5 20 18.18 -20 1 0
1/14 Houston 55 Memphis 89 -34 -8.5 -4.2 4.3 20 17.39 -20 1 0
1/15 Ohio St. 80 Indiana 63 17 13.5 9.1 -4.4 10 9.09 -10 1 0
1/15 Bradley 78 Northern Iowa 67 11 -10 -7.2 2.8 10 8.70 8.70 0 1
1/15 USC 47 UCLA 66 -19 2 1.5 -0.5 10 9.09 9.09 0 1
1/16 Syracuse 71 Pittsburgh 63 8 13.5 17.3 3.8 10 9.09 -10 1 0

So far this season the model is 50% against the line (and subsequently down about $5) and 75% picking the correct outcome.  The (evolving) model picked 38 games last year, and over the two seasons so far is at a 63% win percentage and 60% versus the line (+$133).  Both are probably short-term aberrations -- the model has a 74% win percentage when tested against my corpus of 12K games.

I won't generally be posting predictions, but I will try to summarize the model's performance a few times during the season, as I'm sure it makes for interesting reading :-).

Wednesday, January 11, 2012

Football Wrap-Up

A quick wrap-up of my performance predicting NCAA football.

This experiment started around the beginning of October, when some friends challenged me to use my program to predict football against a couple of other guys.  In addition to predicting games, we would be "betting" against the line.  We could use any betting strategy we desired to allocate $40 per week.  The default strategy was to bet the biggest differences between the prediction and the line, allocating bets of $10, $8, $6... etc.  My own betting strategy was a bit more complex.  I allocated money according to the formula:

$$ =  80*ABS(Prediction-Line)/(100+5*ABS(Line)))

.The idea being to scale the bet to the relative magnitude of the difference between the prediction and the line.  A difference of 3 points is much more significant when the line is 3 than when the line is 27.

I predicted games from Oct 16 through the end of the bowl season.  My program doesn't account for neutral site games, so the bowl games were treated as home games for the higher-ranked team.  (This works well in practice on the basketball side for the NCAA tournament.)  I predicted a total of 224 games.  The results:

Correct game winner73%
Correct pick against the line56%
Betting result+$29

Overall, better results than I expected.  56% against the line is sufficient to be a winning bettor (if it can be maintained).

Thursday, January 5, 2012

A Call from Bill Hancock

(I promise to get back to the prediction stuff soon -- after a dalliance with NCAA football I've started to ramp back up for basketball.  In the meantime, this imagined scenario from last night, inspired by my earlier comment that the NCAA was only allowing ten players on defense to spice up the bowl season.)

Phone call at spacious Turner Mansion last night:

(Ring, Ring)

Me:  Hello?  Oh, hello Mr. Hancock.  How is your job as BCS Executive Director going?

Hancock: (mumble mumble mumble)

Me: Well, you're welcome.  I'm glad my suggestion to only play ten players on defense has worked out so well.

Hancock: (mumble mumble mumble)

Me: *Nine* on defense?  No, I'm not sure that's a good idea.  We've been counting on the fact that most sports writers can't count past ten.   So far they haven't noticed.  But you put nine players out there and someone is going to write about it.  And where does it all end?  Eight players?  Seven players?

Hancock: (mumble mumble mumble)

Me: No, sir, that was a joke.  I'm not recommending seven players on defense.  Listen, I don't think this is a good idea.  Baylor just obliterated the points scoring record for a bowl game.  This is Baylor, the doormat of the Big 12, a university whose only men's championship is in *tennis*.  And then you had Wisconsin -- Wisconsin of all teams! -- throwing the ball all over the field and scoring 38 points.  That's more than the Wisconsin basketball team scored last season.  I realize you want to turn it up to eleven for the Orange Bowl, but this is not a good idea.

Hancock: (mumble mumble mumble)

Me: You're worried about Clemson's defense?  With all due respect, sir, Clemson is an ACC team.  The last time the ACC won a meaningful bowl game it was actually played for a bowl.  If you gave the ACC space lasers they couldn't defend Fort Knox against a Boy Scout troop.

Hancock: (mumble mumble mumble)

Me: True, it is West Virginia.

Hancock: (mumble mumble mumble)

Me: No, sir, West Virginia is part of the United States.

Hancock: (mumble mumble mumble)

Me: No apology necessary.  It's a common misconception.

Hancock: (mumble mumble mumble)

Me: Well, you do what you have to do, sir.  Personally, I'm a traditionalist.  Just tell the officials the result and let them take care of it.  That's worked for Duke basketball for decades and no one's the wiser.  Do they have a "charging" call in football?  I can't remember.  But I'm sure you'll make a good decision.

Hancock: (mumble mumble mumble)

Me:  "Bet the over"?  Ha, ha, good one, sir.