She also gave me a pointer to her thesis, which is online here. I haven't yet had a chance to read it, but judging by the contents it has a lot of interesting material. It includes Matlab code as well, for those of you who are Matlabbers. (My own thesis is much less interesting :-)

As most people are probably aware, the PageRank algorithm was developed by Larry Page and Sergey Brin as a method for making enormous amounts of money. It also, coincidentally, turned out to be a great way to determine which search results are most relevant. The basic idea is that every link to a page acts as a "vote" for the relevancy of that page. The value of that vote depends upon the relevancy of the voting page, and so on ad infinitum. As you might guess if you have been following this blog, this is essentially an iterative algorithm, and in fact ends up in a formulation very similar to the matrix computation in LRMC.

In Govan's application of PageRank to rating sports teams, we construct an NxN matrix "A", where N is equal to the number of teams. When Team J beats Team I by MOV points, we set A[I,J]=MOV. This is a "vote" from Team I to Team J -- and the strength of this vote is proportional to the margin of victory.

Next we adjust this matrix in several ways. First, we scale each row so that it sums to 1. Second, we have to fix undefeated teams. Since they haven't lost a game, they aren't voting for anyone else -- their row in A is all zeros. This causes a problem when trying to solve for the ratings (it makes the matrix irreducible). There are many possible solutions, but the one Govan chooses is to set every value in an undefeated team's row to 1/N -- in other words, it splits its vote equally amongst all the teams. Finally, we mix this matrix A with a "personalization" matrix V. Again, this personalization matrix can be anything at all, but Govan chooses to use a simple matrix with every value set to 1/N. The mix is done like this:

G = 0.85 * A + 0.15 * V

The constant 0.85 is called alpha, and is another arbitrary value. Having done all that, we now solve for the eigenvector of G using the Power Method (as we did with LRMC) but iteratively solving the equation:

πThe resulting vector π represents the team ratings._{k+1}= π_{k}G

So we go off and merrily code this up and test it, and we find this performance:

Predictor | % Correct | MOV Error |
---|---|---|

TrueSkill + iRPI | 72.9% | 11.01 |

Govan (best) | 73.5% | 10.80 |

PageRank | 66.4% | 12.86 |

Sadly, not very impressive performance. There are a number of ways to tweak the PageRank algorithm. The most interesting turns out to be the factor "alpha" that controls the mix of the matrix A (based on game results) and matrix V (all teams ranked equally). The best-performing mix turns out to be when alpha is around 0.10:

Predictor | % Correct | MOV Error |
---|---|---|

TrueSkill + iRPI | 72.9% | 11.01 |

Govan (best) | 73.5% | 10.80 |

PageRank (alpha=0.85) | 66.4% | 12.86 |

PageRank (alpha=0.10) | 70.8% | 11.79 |

PageRank still isn't competitive with the best in breed of our algorithms, but it does improve considerably with this tweak. What's fascinating is that this balance would seem to almost completely discount the game results. I have no intuition why this sort of lopside discounting of the game results should perform so well. It would be interesting to explore whether a similar approach could improve the performance of LRMC (which has a very similar formulation to PageRank).

Other tweaks that I have tried (primarily tweaking the correction for undefeated teams) did not improve performance. Several variants suggest themselves to me, so I may try a few other experiments as time permits.

This comment has been removed by a blog administrator.

ReplyDelete