## Wednesday, March 9, 2016

### That's Not Really A Number

Suppose that you're competing in the Kaggle competition and you're using team win ratios and average scoring for the season to predict who is going to win a game.  Your input to your model might look something like this:

TeamWin RatioAve. ScoreTeamWin RatioAve. Score
Michigan0.7586.5UCLA0.7381

Your results are mediocre, so you decide to improve your model by adding more information about each team.  The seeding of the team -- the NCAA Tournament committee's assessment of team strength -- seems like it would be useful for prediction, so you add each team's seeds to your inputs:

TeamWin RatioAve. ScoreSeedTeamWin RatioAve. ScoreSeed
Michigan0.7586.55UCLA0.738114

You've just made a mistake.  Do you see what it is?

The way you've added the seeding information, many machine learning tools / models are going to treat the seed as a number1, not any different from the Win Ratio or the Average Score.  And that's a problem, because the seed is not really a number.  It's actually what statisticians would call a categorical variable, because it can take one value out of a fixed set of arbitrary values.  (Machine learning types might be more likely to call it a categorical feature.)  If you're not convinced about that, imagine replacing each seed with a letter -- the #1 seed becomes the A seed, the #2 seed becomes the B seed and so on.  This is perfectly reasonable -- we could still talk about the A seeds being the favorites, we'd know in the first round the A seeds play the P seeds and so on.  It wouldn't make any sense to try to do that with a numeric feature like (for instance) the Average Score.

Another difference between numeric and categorical features that merely look like numbers is that real numeric features have a fixed scale, while categorical features have an arbitrary scale.  The difference between an Ave. Score of 86.5 and an Ave. Score of 81 is 5.5 points, and that's the same amount as the difference between an Ave. Score of 78.8 and 73.3.  But the difference between a 16 seed and a 15 seed might be quite different than the difference between a 1 seed and a 2 seed.  (Not to mention that the difference between a 16 seed and a 15 seed might be quite different than between the 16 seed and a different 15 seed!)

So if I've convinced you that team seeds are not really numbers but just look like numbers, then how should your represent them in your model?

The basic approach is something called "one hot encoding" (the name derives from a type of digital circuit design).  The idea behind one hot encoding is to represent each possible value of the categorical feature as a different binary feature (with value 0 or 1).  For any particular value of the categorical feature, one of these binary features will have the value 1 and the rest will be zero.  (Hence "one hot".)  To represent the seeds, we need 16 binary features:

TeamWin RatioAve. ScoreSeed_1Seed_2Seed_3Seed_4Seed_5...Seed_16TeamWin RatioAve. ScoreSeed_0...
Michigan0.7586.500001...0UCLA0.73810...

This representation gives your model much more flexibility to learn the differences between the seeds.  For example, it could learn that being a 5 seed is every bit as valuable as being a 4 seed.  If you represented the seed as a number, it would be impossible to learn that.