Tuesday, November 15, 2011

k-NN Prediction

"Andywocky" commented not too long ago on my Prediction by Similarity posting asking whether I'd looked at k-nearest neighbors (k-NN) algorithms.  At the time I made the original posting I hadn't, but shortly thereafter I had a "D'oh" moment and realized that what I was doing was re-creating k-NN.  So I re-created some of the work I'd done using RapidMiner's k-NN operator.

The basic idea behind k-NN is that we predict the outcome of a new game by finding some number of similar past games, and then use those (say by averaging) to create a prediction for the new game.  The "k" in "k-NN" refers to the "some number" of similar past games -- k might be 5 or 50, indicating that we were using the five most similar, or 50 most similar past games.  "Nearest Neighbor" is just another way of saying similar.  If we think of the games living in a multi-dimensional space -- say a dimension for each statistical value for the game (e.g., rebounds per minute, free throw percentage, etc.) -- then the most similar games are the ones that are the nearest neighbors in this multidimensional space.

There are some subtleties in how this works.  For example, team free throw percentage might vary from (say) 50% to 100%, while rebounds per minute might vary from 0.00 to 0.056.  If we don't normalize those dimensions, one or the other is likely to be far more important in determining the nearest neighbor than the other. But a reasonable starting approach is to characterize each game with as many statistical properties as we have, normalize those to similar scales, and then predict MOV by averaging the MOVs of k nearest-neighbors.

Here's the result of doing that with k=10.  For comparison, I include the performance of the best linear regression predictor based upon the same statistical properties.

  Predictor    % Correct    MOV Error  
Best Statistical Predictor72.3%11.04
k-NN, k=1059.7%11.65

This isn't tremendous performance, but we have a few tweaks we can perform.  First, we can try varying k to see if some different number of neighbors provides better performance.  Some searching around produces the best performance in this case when k=41:

  Predictor    % Correct    MOV Error  
Best Statistical Predictor72.3%11.04
k-NN, k=1059.7%11.65
k-NN, k=4171.2%11.44

Interestingly, this shows a lot of improvement in games correct with only modest improvement in MOV error.

Another tweak we can look at is weighting our results.  Instead of doing a flat average of the 41 nearest neighbors, we can weight each neighbor's contribution to the answer by how close it is to the new game.  We can also try eliminating some of our dimensions to see if accuracy improves.  This provides some further improvement:

  Predictor    % Correct    MOV Error  
Best Statistical Predictor72.3%11.04
k-NN, k=1059.7%11.65
k-NN, k=4171.2%11.44
k-NN, k=44, weighted subset72.4%11.17

With this tweak k-NN is competitive with the best linear regression.  (Although they both trail the best predictors.)

I'm inclined to draw a couple of conclusions from these experiments.  First, 40+ neighbors is a large number, suggesting that while games between statistically similar may be broadly comparable, there's not a strong relationship.  Conversely, the improvement gained by using weighting suggests that closer is still better.  It would seem that good performance with this approach requires a moderate amount of generalization to help "wash out" the random component in game outcomes.

No comments:

Post a Comment