Imagine that you have a model that predicts the final margin of victory (MOV) of a college basketball game. If you ran this model on a bunch of training examples, you could calculate the error for each example (the actual MOV minus the predicted MOV). You could then try to train a second model to predict that error. And if you could do that, your combined prediction (the predicted MOV from the first model + the predicted error from the second model) would be better than your first model. This idea is known as gradient boosting (or additive regression).

However, to be useful, there has to be something that the gradient boosting can find in the errors from the first model. To understand why this is important, consider using gradient boosting with linear regression. The linear regression algorithm produces the

*best*linear solution to the data. There's no other solution which will yield smaller errors. So if you try to fit a second linear regression to the errors that are left behind, you get no improvement over the original model. (Because no improvement is possible.)

There are two ways to overcome this limitation. The first is to use a weaker base model. That may seem non-intuitive, but it has the nice property of automatically turning a weak model into a stronger one. (And in fact gradient boosting is usually applied to tree models for just this reason.) The second approach is to use a different type of model for the second model. For example, we could build a linear regression in the first step, and then a tree model in the second step. The tree model could pick up some non-linear information in the errors that the linear regression would miss.

Gradient boosting is just one form of boosting. In general, "boosting" methods try to improve accuracy by boosting the importance of training examples that have errors. One approach is to weight the training examples, create a base classifier, find the training examples where the base classifier is wrong, increase the weight of those examples and create a new classifier. The idea is that by forcing the base classifier to pay more attention to the incorrect examples, it will become more accurate on those examples. A popular version of this approach is called adaptive boosting or AdaBoost and has proven to be very effective for a wide range of problems.

Machine learning toolkits like WEKA and RapidMiner have various forms of boosting built in, including both adaptive regression and AdaBoost, but the idea is straightforward enough that it's not difficult to implement yourself if you need more control over the approach. In the case of the Pain Machine, I haven't found a boosting approach that improves the accuracy of the base linear regression, but I have found that some tree approaches can be boosted to an accuracy close to that of the linear regression. I suspect that boosting is less useful for problem domains with large amounts of noisy data, or for models that are already close to the best possible performance, but that's just speculation on my part.