In two previous postings, I took a look at Massey ratings and showed how to calculate the rating in Python. In this posting, we'll create a more complex version of the Massey rating and show how to calculate it.

Recall that the basic notion of the Massey rating is that the expected outcome of a game (in terms of margin of victory) is equal to the difference in ratings between the two teams:

$e(MOV_{ij}) = R_i - R_j$

So in some sense the rating $R$ is the "strength" of a team, and the stronger team is expected to win by about the difference in ratings between the two teams. An obvious extension of this approach is to think of teams as having not just one strength rating, but two: one for their offensive strength and one for their defensive strength. We expect a team's score to be it's offensive rating minus the defensive rating of its opponent:

$e(Score_{ij}) = O_i - D_j$

There's one more wrinkle to add before we jump into modeling this rating. Since we're now looking at scores instead of margin of victory, we have to account for the home court advantage. To do this, we'll use two equations, one for the home team and one for the away team:

$$e(Score_h) = O_h - D_a + HCA$$

$$e(Score_a) = O_a - D_h$$

The home team's score is expected to be boosted by the $HCA$ while the away team is not.

Given some game outcomes, how can we calculate the team ratings? Let's start with some game data.

games = [[0, 3, 83, 80], [1, 4, 91, 70], [1, 5, 80, 85], [2, 0, 73, 60], [2, 3, 73, 60],
         [3, 0, 85, 60], [4, 5, 70, 78], [4, 3, 85, 70], [5, 1, 91, 70], [5, 4, 60, 68]]

Here I've invented data for ten games between six teams. Each game is represented by four elements:

$$[Team_h, Team_a, Score_h, Score_a]$$

From each of these games we can generate two equations corresponding to the equations above. For example, for the first game:

$$83 = O_0 - D_3 + HCA$$$$80 = O_3 - D_0$$

Altogether we'll have twenty equations. To represent each equation, we'll use a row that looks like this:

$$[O_0, O_1, O_2, ... D_0, D_1, D_2, ... HCA]$$

To represent an equation, we place a 1 in the spot for the Offensive rating, a -1 in the spot for a Defensive rating, and a 1 in the HCA spot if this is the home team's score. Everything else is a zero. So the first game we get the following two rows:

$$[ 1, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 1.]$$$$[ 0, 0, 0, 1, 0, 0, -1, 0, 0, 0, 0, 0, 0.]$$

Do this for all the games and we end up with a 20x13 matrix. Here's the code to create that matrix:

import numpy as np
def buildGamesMatrix2(games, num_teams):
    l = len(games)
    M = np.zeros([l*2, num_teams*2+1])
    for i in xrange(l):
        g = games[i]
        M[i, g[0]] += 1
        M[i, g[1]+num_teams] += -1
        M[i, num_teams*2] += 1
        M[l+i, g[1]] += 1
        M[l+i, g[0]+num_teams] += -1
    return M

M2 = buildGamesMatrix2(games,6)
print M2

[[ 1.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.  0.  1.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.  1.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  1.]
 [ 0.  0.  1.  0.  0.  0. -1.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0. -1.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.  0. -1.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0. -1.  1.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0. -1.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  1.  0. -1.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0. -1.  0.  1.]
 [ 0.  0.  0.  1.  0.  0. -1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0. -1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0. -1.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0. -1.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0. -1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0. -1.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0. -1.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0. -1.  0.]]

The way I built this array all the home teams are first, but the order is irrelevant. The next thing we need is a corresponding array with the 20 scores. Here's the code to build that:

def buildOutcomes2(games, n):
    l = len(games)
    E = np.zeros([l*2])
    for i in xrange(l):
        g = games[i]
        E[i] += g[2]
        E[l+i] += g[3]
    return E

E2 = buildOutcomes2(games, 6)
print E2

[ 83.  91.  80.  73.  73.  85.  70.  85.  91.  60.  80.  70.  85.  60.  60.
  60.  78.  70.  70.  68.]

Again, this looks like a row, but Python is happy to treat it as a column.

The full set of equations can be written as the matrix equation:

$$M * R = E$$

The problem is to solve this equation for $R$, the vector of team ratings. Since this is a set of linear equations, we can use Numpy least squares solver to find the answer:

R = np.linalg.lstsq(M2,E2)[0]
print R

[ 29.45226244  47.04230769  24.77895928  36.20520362  35.44954751
  38.06945701 -42.32149321 -42.49615385 -27.17126697 -40.27443439
 -31.4418552  -27.29253394   6.92307692]

The vector is the team ratings. For example, Team 0's offensive rating is 29.45226244 and its defensive rating is -42.32149321. The last term is the HCA. Let's make a little pretty print function to make it easier to examine:

def ppResults(r, n, M, E):
    for i in xrange(n):
        print "Team[%d]: (%0.2f, %0.2f)" % (i, r[i], r[i+n])
    print "HCA: %0.2f RMSE: %0.2f MAE: %0.2f" % (r[n*2], np.mean((np.dot(M,r)-E)**2)**0.5, np.mean(np.abs(np.dot(M,r)-E)))
ppResults(R,6, M2, E2)

Team[0]: (29.45, -42.32)
Team[1]: (47.04, -42.50)
Team[2]: (24.78, -27.17)
Team[3]: (36.21, -40.27)
Team[4]: (35.45, -31.44)
Team[5]: (38.07, -27.29)
HCA: 6.92 RMSE: 5.87 MAE: 4.45

I've also shown a couple of error measures, based upon the difference between the actual results ($E$) and the predicted results ($M*R$). You'll also note that the defensive ratings are a little strange. Since they're negative, they're actually mean something like "the number of points this team adds to the other team's score." The math doesn't care, but if it bothers us we can adjust the scores so that the defensive ratings are all positive:

R = np.linalg.lstsq(M2,E2)[0]
correction = np.min(R[6:12])
R[0:12] -= correction
ppResults(R, 6, M2, E2)

Team[0]: (71.95, 0.17)
Team[1]: (89.54, 0.00)
Team[2]: (67.28, 15.32)
Team[3]: (78.70, 2.22)
Team[4]: (77.95, 11.05)
Team[5]: (80.57, 15.20)
HCA: 6.92 RMSE: 5.87 MAE: 4.45

(Note that we have to be careful not to "correct" the HCA term in R[12]!) Now it is more apparent that Teams 2 and 5 play very good defense, and Teams 0 and 1 not so much.

We can also solve this problem using the LinearRegression function from SciKit Learn:

from sklearn import linear_model
clf = linear_model.LinearRegression(fit_intercept=False)
clf.fit(M2,E2)
correction = np.min(clf.coef_[6:12])
clf.coef_[0:12] -= correction
ppResults(clf.coef_, 6, M2, E2)

Team[0]: (71.95, 0.17)
Team[1]: (89.54, 0.00)
Team[2]: (67.28, 15.32)
Team[3]: (78.70, 2.22)
Team[4]: (77.95, 11.05)
Team[5]: (80.57, 15.20)
HCA: 6.92 RMSE: 5.87 MAE: 4.45

This generates exactly the same answer. Which shouldn't be surprising, because under the covers the LinearRegression function calls the least squares solver. The advantage of doing it this way is that we can easily swap out the LinearRegression for a different model, for example, a Ridge regression:

from sklearn import linear_model
clf = linear_model.Ridge(fit_intercept=False, alpha=0.1)
clf.fit(M2,E2)
correction = np.min(clf.coef_[6:12])
clf.coef_[0:12] -= correction
ppResults(clf.coef_, 6, M2, E2)

Team[0]: (71.63, 2.81)
Team[1]: (85.36, 0.00)
Team[2]: (65.39, 16.40)
Team[3]: (78.62, 4.20)
Team[4]: (75.55, 11.07)
Team[5]: (78.16, 14.12)
HCA: 9.95 RMSE: 6.10 MAE: 4.69

As you can see, this returns slightly different results. You can read about Ridge regression in detail elsewhere, but the gist is that it addresses some problems in least squares regression that often occur in real-world data. It's not important in this toy example, but if you actually want to calculate this rating for NCAA teams, it can be useful.

There's yet another approach we can use to solve this problem: stochastic gradient descent (SGD). SGD is a method for finding the minimum of a function. For example, consider the following function (taken from here):

from scipy import optimize
import pylab as plt
%matplotlib inline
import numpy as np
def f(x):
    return x**2 + 10*np.sin(x)
x = np.arange(-10, 10, 0.1)
plt.plot(x, f(x)) 
plt.show()

The minimum point of this function is around -1.3. How could we figure that out? Imagine that we could place a ball bearing on the graph of this function. When we let go, it will roll downhill, back and forth, and eventually settle to the lowest point. That's essentially how SGD works. It picks a point somewhere on the graph, measures the slope at that point, steps off a bit in the "downhill" direction, and then repeats. And because this is done numerically, it works for any function. Let's try it:

print "Minimum found at: %0.2f" % optimize.fmin_bfgs(f, 0)[0]

Optimization terminated successfully.
         Current function value: -7.945823
         Iterations: 5
         Function evaluations: 24
         Gradient evaluations: 8
Minimum found at: -1.31

That's pretty straightforward, so let's try this for our ratings. First we have to write the function that we're trying to minimize. If you think about it a moment, you'll realize that what we're trying to do is minimize the error in our ratings. That's just the term from our pretty-print function, so we can pull it out and create an objective function:

def objf(BH, M2, E2):
    return np.mean((np.dot(M2,BH)-E2)**2)**0.5

We also need to provide a first guess at values for BH. We'll take a simple way out and guess all ones. Then we can stick it all into the optimize function and see what happens.

first_guess = np.ones(13)
answer = optimize.fmin_bfgs(objf, first_guess, args=(M2, E2))
correction = np.min(answer[6:12])
answer[0:12] -= correction
ppResults(answer, 6, M2, E2)

Optimization terminated successfully.
         Current function value: 5.873501
         Iterations: 74
         Function evaluations: 1245
         Gradient evaluations: 83
Team[0]: (71.95, 0.17)
Team[1]: (89.54, 0.00)
Team[2]: (67.28, 15.32)
Team[3]: (78.70, 2.22)
Team[4]: (77.95, 11.05)
Team[5]: (80.57, 15.20)
HCA: 6.92 RMSE: 5.87 MAE: 4.45

And sure enough, it finds the same answer.

There are several drawbacks to solving our ratings this way. First of all, it can be slower (possibly much slower) than the linear least squares solution. Secondly, it can sometimes fail to find the true minimum of the function. It can get "trapped" into what is called a local minimum. Recall the graph of the example function:

plt.plot(x, f(x)) 
plt.show()

See the little valley around 4? That's a local minimum. If we happen to start in the wrong place, SGD can roll down into that minimum and get stuck there.

print "Minimum found at: %0.2f" % optimize.fmin_bfgs(f, 3, disp=0)[0]

Minimum found at: 3.84

Oops. There are ways to address this problem, but I won't discuss them here.

Despite these drawbacks, SGD has a couple of big advantages over linear regression. First, linear regression always optimizes over mean square error (that's the RMSE metric), and that may not be what you want. For example, The Prediction Tracker shows both mean square error and mean absolute error (that's MAE). My predictor was often better at RMSE and worse at MAE, which irked me. If it had bothered me enough, I could have used SGD to optimize on MAE instead of RMSE:

def ourf(BH, M2, E2):
    # Note the different return value here
    return np.mean(np.abs(np.dot(M2,BH)-E2))
first_guess = np.ones(13)
answer = optimize.fmin_bfgs(ourf, first_guess, args=(M2, E2))
correction = np.min(answer[6:12])
answer[0:12] -= correction
ppResults(answer, 6, M2, E2)

Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 3.200011
         Iterations: 94
         Function evaluations: 2802
         Gradient evaluations: 186
Team[0]: (76.79, 0.00)
Team[1]: (88.79, 2.00)
Team[2]: (66.79, 18.32)
Team[3]: (78.47, 0.32)
Team[4]: (78.79, 8.47)
Team[5]: (86.47, 15.32)
HCA: 6.53 RMSE: 6.98 MAE: 3.20

This solution has a better MAE (and a worse RMSE). In general, you can choose any objective function to optimize through SGD, so that provides a lot of flexibility.

The second advantage of SGD is that it isn't limited to linear functions. You could use it to minimize a polynomial or other complex function. That isn't directly relevant for this rating system, but it could be for a different system.

Net Prophet

Friday, October 2, 2015

Calculating the Massey Rating, Part 3