## Friday, January 20, 2012

### The Continued (Slow) Pursuit of Statistical Prediction

When we last met on this topic, I was inspired by the Four Factors to look at derived statistics created from the ratio of two existing statistics, e.g.,

Offensive Balance = (# 3-Pt Attempts) / (# FG Attempts)

My previous work in this area has convinced me of the value of looking at all possibilities, no matter how non-intuitive, my approach was to look exhaustively at all the possible ratios between the ~35 base statistics.  That leads to some crazy statistics such as:

(Average # of fouls per possession by the home team's previous opponents) /
(Average offensive rebounds per game by the home team in previous games)

There turn out to be a number of difficulties with this approach.  (Perhaps not unsurprisingly, although crazy nonsensical statistics are not one of them.)

First, it's a lot of work just to generate the 1060 derived statistics.  (Only 1060 because I avoided inverse ratios, and avoided "cross-ratios" between the two teams.)  Initially I was generating a subset of these from within the Lisp code that pre-processes the game data.  That was painful to set up and slow to execute.  Eventually I discovered a way to generate the derived statistics within RapidMiner. I was able to drive this from a data file, so I wrote a small Lisp program to generate the data file that RapidMiner could use to construct all 1060 derived statistics.

Second, this amount of data tended to overwhelm my tools.  With the derived attributes, each game has about 1200 attributes total.  My training corpus has about 12K games.  The combination tended to break most of the data modeling features of RapidMiner, usually by overwhelming the memory capacity of Java.  Even when the software was capable of handling the data volume, operations like a linear regression might take hours (or days!) to complete, so testing and experimenting was laborious at best.

One way to reduce this problem is to thin the dataset, by testing on (say) a tenth of the full corpus.  But that introduced a new problem: overfitting.  If I used (say) a tenth of the data, I had about 1200 games in my test set -- just about the same number of test games as attributes.  The result of that is almost invariably a very specific model that does extremely well on the test data and very poorly on any other data.

Another approach is to thin the attributes.  This is the feature selection problem.  The idea is select the best (or at least some reasonably good) set of features for a model.  The stupid (but foolproof) way to do this is to try every possible combination of features, and select the best combination.  But of course that's infeasible in many cases (such as mine), so a variety of alternative approaches have been created.  RapidMiner has some built-in capabilities for feature selection, and there's a nice extension to add more feature selection capabilities here.

I experimented with a variety of different feature selection approaches.  I was hopeful that different approaches would show overlap and help identify derived attributes that were important, but for the most part that did not happen.  However, taking all the attributes recommended by any of the feature selection approaches did give me a more reasonable sized population of derived statistics to test.

More on this topic next time.