A few miscellaneous things I've experimented with recently.
At some point over the past few years I snagged a dataset which had the locations (latitude and longitude) of many of the Division 1 arenas. I filled out the dataset and added in the date the arenas were constructed. I also started grabbing the attendance numbers for games. With the locations of all the team arenas, I can calculate the travel distance for Away teams. (For Neutral site games, I just use a generic 500 mile travel distance for both teams.) I was then able to feed all this data into the predictor.
There's a small correlation between travel distance and performance; teams that travel further generally perform more poorly. But the effect is small, and the impact upon prediction accuracy is not significant.
Attendance numbers correlate positively with performance -- teams that get lots of fans tend to do well. Cause and effect are probably reversed here -- fans turn out to see good teams. Again, there's little benefit to prediction accuracy from the attendance numbers, probably because that information is already captured in the other statistics that describe how good the team is. (If attendance was a good predictor it would be problematic, because we don't usually know the attendance numbers until the game is over!)
There's also a weak positive correlation between the date an arena was built and how well the team performs. This is probably because good teams get new arenas. For example, after winning the National Championship, the Maryland basketball team moved out of the rickety on-campus gym and into a shiny new arena.
On a different note, I also started calculating the variance for team statistics and using them for prediction. Variance tells us how much a variable is changing. Imagine two teams that both average 10 offensive rebounds per game. Team A has gotten exactly ten offensive rebounds in every game. Team B has gotten 0 offensive rebounds in half their games, and twenty in the other half. Their averages are the same, but Team B has a much higher variance.
One of the obvious ways to use variance is to quantify "confidence". We might be much more confident in our prediction about a team that has very little variance (i.e., performs very consistently) than our prediction about a team that has a lot of variance. I've had some success with this (particularly for predicting upsets in the NCAA tournament) but it's not an area where I've yet done a lot of experimentation.
But interestingly enough, it turns out that variance has some value as a direct predictor of performance. In some cases, variance is a good thing -- a team with high variance in some statistic does better than a team with low variance. In other cases, it's the opposite. I've notice this in the past with some statistics (like Trueskill) that produce a variance as part of their calculation, but I decided to calculate variance consistently for all the statistics and test it for predictive value.
The results were mixed. Some of the variance calculations were statistically significant and were selected by the model. On the other hand, they didn't significantly improve accuracy. I ended up keeping a number of the variance statistics because they represent a different aspect of the data, and I hope that means they'll make the model more consistent overall.