I've gotten consumed in the last few days with a bug in the predictor, and I forgot I was intending to post the second part of this notebook. The good news is that in the meantime I figured out how to make these notebooks play better with Blogger.
In this posting I'll expand the basic Massey rating with a few additions. First, let's recover the game list and the definitions from the previous posting.
games = [[0, 3, 3], [1, 4, 21], [1, 5, 5], [2, 0, 13], [2, 3, 13],
[3, 0, 25], [4, 5, 8], [4, 3, 15], [5, 1, 21], [5, 4, 8]]
import numpy as np
def buildGamesMatrix(games, num_teams):
M = np.zeros([len(games), num_teams])
row = 0
for g in games:
M[row, g[0]] = 1
M[row, g[1]] = -1
row += 1
return M
M = buildGamesMatrix(games,6)
def buildOutcomes(games):
E = np.zeros([len(games)])
row = 0
for g in games:
E[row] = g[2]
row += 1
return E
E = buildOutcomes(games)
bh = np.linalg.lstsq(M,E)[0]
print bh
One point that Massey makes in his thesis is that it is sometimes impossible to solve this set of equations. This occurs when the teams are divided into two or more subsets that have never played each other. (In math terms, we can only solve the equations when the graph of teams is connected.) In practice this isn't a problem for me, because I don't start calculating ratings until about 800 games into the season, and by that time the graph is connected. But it can be a problem in a sport like football, so how do we address this?
Massey recommends forcing the graph to be connecting by arbitrarily changing one row of $M$ to be all ones, and the corresponding element of $E$ to zero. For example:
print "\nBefore forcing connectivity:\n"
print M
print
print E
M[9:] = 1.0
E[9] = 0.0
print "\nAfter forcing connectivity: \n"
print M
print
print E
Now we can re-run our ratings with this new $M$.
bh = np.linalg.lstsq(M,E)[0]
print bh
As you can see, this has (not unsurprisingly) changed the ratings. In addition to making the graph connected, this change also tries to make the mean of the ratings equal to 0.
I prefer a variant approach which adds the connectivity fix as a new "game" instead of overwriting the last game. This avoids impacting the ratings.
M = buildGamesMatrix(games,6)
E = buildOutcomes(games)
M = np.vstack((M, [1, 1, 1, 1, 1, 1]))
print M
E = np.append(E, 0)
print E
And now we'll calculate the ratings again.
bh = np.linalg.lstsq(M,E)[0]
print bh
Now the connectivity fix doesn't affect the ratings. (In this case, anyway. If the matrix wasn't well-connected, or the ratings didn't already sum to zero, it would still have an effect.)
Moving on, you may have noticed in the previous notebook about the Massey rating that there was no mention of whether teams were playing at home or away. Games were expressed as a winner, a loser and a margin of victory. So let's add home & away. To start with, we'll modify our game representation so that instead of
we have
where $MOV$ is the more familiar definition of the score of the home team minus the score of the away team. I've modified the games array from the previous posting to make a few games wins by the Away team. Since I choose to put Home first in the new representation, and changed Margin to MOV, nothing actually changes in the code. So let's calculate the ratings for this new set of games.
games = [[0, 3, 3], [1, 4, 21], [1, 5, -5], [2, 0, 13], [2, 3, 13],
[3, 0, 25], [4, 5, -8], [4, 3, 15], [5, 1, 21], [5, 4, -8]]
M = buildGamesMatrix(games,6)
E = buildOutcomes(games)
M = np.vstack((M, [1, 1, 1, 1, 1, 1]))
E = np.append(E, 0)
bh = np.linalg.lstsq(M,E)[0]
print bh
So we now have Home/Away represented and we have forced connectivity. Another issue we might want to address is that the Massey rating system doesn't have the notion of a "home court advantage". You can see that it predicts the same game outcome for $Team_i$ versus $Team_j$ regardless of whether $Team_i$ is Home or $Team_j$ is Home.
One way we can try to address that shortcoming is to add a HCA constant to the expected outcome equation, like so:
Here $C_{HCA}$ is a constant for all games. We can represent this as a new column in $M$ of all 1s (because it affects every game), except in the connectivity row (Do you see why? $C_{HCA}$ captures a bias in scoring data, so we don't want to force it to sum to zero with the ratings.).
M = buildGamesMatrix(games,6)
E = buildOutcomes(games)
# Adjust for connectivity
M = np.vstack((M, [1, 1, 1, 1, 1, 1]))
E = np.append(E, 0)
# Calculate rating & error before HCA
bh = np.linalg.lstsq(M,E)[0]
print "Error rating without HCA:\n"
print sum(map(abs, M.dot(bh)-E))/10
# Adjust for HCA
M = np.hstack((M, [[1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [0]]))
print "\nM with HCA column:\n"
print M
bh = np.linalg.lstsq(M,E)[0]
print "\nRatings with new HCA:\n"
print bh
$C_{HCA}$ is the last term in $bh$ and you can see that in this example it is quite substantial. Did adding this make our rating system any more accurate? Let's calculate the error with $C_{HCA}$:
sum(map(abs, M.dot(bh)-E))/10
So yes, in this case adding $C_{HCA}$ to our model resulted in an improvement of about 0.8 in our MAE.
As a last note, if you read Massey's thesis you'll see he's actually solving a set of equations that look like this:
which is to say that he's multiplied both sides by $M^T$ (the transpose of $M$). This has the effect of reducing the number of equations from the number of games to the number of teams, as you can see here:
print "M is 11 rows long:\n"
print M
print "\nMt*M is 6 rows long:\n"
print M.T.dot(M)
Applying the same operation to both sides of the equation doesn't change the answer, so you get the same result if you solve the new formulation.
bh = np.linalg.lstsq(M.T.dot(M),M.T.dot(E))[0]
print "\nRatings with new formulation:\n"
print bh
The advantage (at least in 1997 when Massey wrote his thesis) was that it was easier/faster to solve the smaller set of equations. I'm not sure if that's still true and it might be that Python's linear solver is smart enough to do this itself. But you'll get the same result either way.
Look at you and your fancy Jupyter notebook self! (Note, the cool kids call it Jupyter these days, not IPython). Looks great on my end, I think you got the layout sorted.. Glad to see the Python conversion is going well! I'm curious how you find it vs your original Lisp code? I figured between Pandas and NumPy you have to be pretty happy with the ease of a lot of things (though admittedly I know very little about Lisp, and what similar functionality might exist). Did you track down the bug in your predictor yet?
ReplyDeleteThanks, Brandon! I edited the CSS code to take out all the fixed "width:" attributes and that fixed the formatting problem, although I noticed that it also wiped out the header at the top of the blog (!). Oh well, good enough.
ReplyDeleteAs far as the bug goes, it's really puzzling. I add a new attribute to my model that's just two existing attributes multiplied together. I build a model with that and the error for certain years shoots through the roof, i.e., it goes from something like 11 to (say) 4000. But inspection shows nothing particularly wonky with the new attribute. Normalizing the inputs to the model mostly fixes the problem, but I can't see why since the new attribute is in the same range as the existing attributes.
At any rate it was more a curiosity than anything and I've moved on :-)