Tuesday, September 29, 2015

Missed It By *That* Much!

After five years of working on predicting college basketball, I've honed a decent intuition for the problem.  That applies doubly to my own predictor.  When I add a feature I have a pretty good idea of how it will affect performance.  Anything too significant and I immediately suspect that I've had data leakage or some other error.  In fact, this happened to me enough in the first year or so that I eventually re-architected my predictor specifically to create separation between the processing of the season's games and the prediction of scheduled games, and I carried that architecture over into the Python rewrite.

So I was a little dubious when I added a new feature to the predictor and saw a significant increase in performance.  The increase wasn't enormous (although with tuning it did become even better) but it was enough to trigger my "Spidey sense".  I commenced to carefully inspect my code, but I couldn't find any leaks.  The code was parallel to a similar feature that I was confident was sound, and I hadn't done anything obvious to violate my separation architecture, so it was a bit puzzling.

My ability to test the problem directly was limited (because portions of the architecture aren't yet built out in Python) but seemed to confirm there was a problem.  Without many other options, I broke out the debugger and started stepping through the program while monitoring the data structures.  Which is a laborious task when you're processing 32,000 games and your model has 500+ features.

Eventually I found the problem.  At one point in the code where I thought I was copying a value out of an array to save it elsewhere, I was actually getting a 1x1 array -- i.e., an array containing just one value.  Python is so flexible with data structures that this actually worked fine throughout the rest of the code.  It turns out that a 1x1 array of a value is just about interchangeable with the value itself.

The rub is that Python doesn't copy arrays unless forced.  So when I grabbed that 1x1 array, I was really just getting a view into the array it came from.  When the value in the original array changed, so did the value in my little 1x1 array.  The end result was that new values in the original array would leak back into what I thought were saved values.

This sort of thing leads a lot of people to hate untyped scripting languages like Python and Javascript.  The flexibility is great, but it can lead to difficult to diagnose errors (and also poor performance).  I'm a little more pragmatic.  All the choices in programming languages have plusses and minuses, and in my experience they tend to largely balance out.

So in the end I didn't discover the secret of college basketball, but at least I found my bug.  So that's something.


  1. Is this really an "untyped" problem, or a problem with inconsistency in aliasing?

  2. Well, I'm not sure how I'd exactly classify the bug. I was just saying that this would have been caught in a strongly typed language. You wouldn't be allowed to assign an array to a scalar.