Friday, February 27, 2015

Five Mistakes Kaggle Competitors Should Avoid

#1 -- Don't think you're going to win the competition.

One of the results that came out of the analysis of last year's contest is that the winner was essentially random:  at least the top 200 entries could have credibly won the contest.  Why?  Evidence from the best predictors suggests that there is about 8 points or so of unpredictability in college basketball games.  That's a lot of randomness.  Last year, 32 of the 64 games in the Tournament were decided by 8 points or less.  So even if you have the most accurate predictor in the contest, you're almost certain to be beaten by someone who made a worse prediction and got lucky when it came true.  It's the same reason why the ESPN pool is usually won by an octopus or someone who picked based on mascot fashions. On the other hand, maybe this year you'll be the guy who gets lucky.  It could happen.

#2 -- Don't use the data from before the 2008-2009 season.

Isn't it nice of the Kaggle administrators to provide data back to 1985?

If you're not familiar with college basketball, you might not realize that the college game underwent a radical change at the beginning of the 2008-2009 season when the NCAA instituted the three-point shot at a consistent distance of 20' 9".  The three-point shot created whole new game strategies, and data from before that season is probably not easily applicable to today's game.

#3 -- The Tournament data is not enough for training or testing.

More like March Sadness

At 64 games a year, the Tournament just doesn't provide enough data for training or even testing a predictor with any reliability.  You may think you're being smart to build your model specifically for the Tournament -- imagine the advantage you'll have over all the other competitors that don't understand how different the Tournament is from the regular season.  Ha!

But actually you're just overfitting your model.   My own predictor needs about 15,000 training examples for best performance.  Your mileage may vary -- maybe you only need 14,000 training examples -- but there just isn't enough information in the Tournament games alone to do accurate prediction.  Particularly since you shouldn't use the games from before 2008 (see #2).  Of course, you can do all that and you might still win the contest (see #1).

#4 -- Beware of leakage!

Guess what?  It turns out that you can do a really good job of predicting the Tournament if you know the results ahead of time.  Who knew?

Now that's not a big problem in the real contest because (short of psychic powers) no one knows the results ahead of time.  But if the forums from last year and this year are any indication, it's a big problem for many Kagglers as they build and test their predictors.  Knowledge from the games they're testing creeps into the model and results in unrealistically good performance.

A First-Time Kaggle Competitor
There are three major ways this happens.

The first and most obvious way this happens is that a model is trained and tested on the same data.  In some cases you can get away with doing this -- particularly if you have a lot of data and a model without many degrees of freedom.  But that isn't the case for most of the Kaggle models.  If you train your model on the Tournament data and then test it on the same data (or a subset of the data), it's probably going to perform unreasonably well.  You address this by setting aside the test data so that it's not part of the training data.  For example, you could train on the Tournament data from 2008 to 2013 and then test on the 2014 Tournament.  (Although see #3 above about using just the Tournament data.)  Cross-validation is another, more robust approach to avoiding this problem.

The second way this often happens is that you unwittingly use input data that contains information about the test games.  A lot of Kagglers use data like Sagarin's ratings without understanding how these statistics are created.  (I'm looking at you, Team Harvard.)  Unless you are careful this can result in information about the test games leaking back into your model.  The most common error is using ratings or statistics from the end of the season to train a model for games earlier in the season.  For example, Sagarin's final ratings are based upon all the games played that season -- including the Tournament games -- so if you use those ratings, they already include information about the Tournament games.  But there are more subtle leaks as well, particularly if you're calculating your own statistics.

The third and least obvious way this happens is when you tune your model.  Imagine that you are building your model, taking care to separate out your test data and avoid using tainted ratings.  You test your model on the 2014 Tournament and get mediocre results.  So you tweak one of your model parameters and test your model again, and your results have improved.  That's great!  Or is it?  In fact, what you've done is leak information about the 2014 Tournament back into your model.  (This can also be seen as a type of overfitting to your test data.)  This problem is more difficult to avoid, because tuning is an important part of the model building process.  One hedge is to use robust cross-validation rather than a single test set.  This helps keep your tuning more general.

How can you tell when you're suffering from leakage?  Your performance can provide an indicator.  Last year's winner had a log-loss score of 0.52, and the median score was around 0.58.  If your predictor is getting performance significantly better than those numbers, then you're either (1) a genius, or (2) have a problem.  It's up to you to decide which.

#5 -- A Miscellany of Important Notes
  • College basketball has a significant home court advantage (HCA).  (And yes, there may be a home court advantage in Tournament games!) Your model needs to account for the HCA and how it differs for neutral court and Tournament games.  If your model doesn't distinguish home and away, you've got a problem.
  • College teams change significantly from season to season.  You can't use a team's performance in one season to predict its performance in another season.  (This seems obvious, but last year's Harvard team seems to have made this mistake.  On the other hand, they got a journal publication out of it, so if you're an academic this might work for you too.)
  • Entering your best predictions might not be the best way to win the contest.  Since the contest has a large random element (see #1 above) your best strategy might be to skew your predictions in some way to distinguish yourself from similar entries, i.e., you should think about meta-strategy.

Wednesday, February 25, 2015

JQAS Paper Reviews

Some of you who participated in last year's Kaggle contest may remember that the Journal of Quantitative Analysis in Sports (JQAS) solicited papers on the methods contestants used to predict basketball game outcomes in the NCAA tournament as part of the Kaggle contest.  The next issue of JQAS will contain five papers that resulted from this solicitation and the publisher has made the papers freely downloadable for a month after the issue is published as well as while they are posted in the "Ahead of Print" section on the JQAS site.  (I have also added them to my Papers archive.)  Below are short reviews of the five papers.

Michael J. Lopez  and Gregory J. Matthews, "Building an NCAA men's basketball predictive model and quantifying its success."

Lopez and Matthews won the 2014 Kaggle Contest.  The paper describes their approach as well as an analysis of how "lucky" their win was.

Lopez and Matthews employed a two-pronged prediction approach based upon (1) point spreads and (2) efficiency ratings (from Pomeroy).  They built separate regression models for points spreads and the efficiency ratings and combined them in a weighted average for their two submissions:  One that weighted point spreads at 75% and efficiency ratings at 25%, and one vice versa.  Since point spreads were only available for the first 32 games, Lopez & Matthews estimated the point spreads for the remaining games using an undescribed "proprietary" model.

Lopez & Matthews also analyzed how "lucky" they were to win the contest.  Their analysis suggests that luck is by far the biggest factor in the competition.  For example, they found that about 80% of the entries could have won the contest under different reasonable outcomes, and the true probability of their entry being the best was less than 2%.
Commentary:  While I appreciate that Lopez & Matthews took the time to write up their experience, I find myself disappointed that this approach ended up winning; it brings nothing novel or interesting to the problem.  Their analysis in the second part of the paper is interesting -- it confirmed my belief that the Kaggle contest was essentially a random lottery amongst the top few hundred entries.
Ajay Andrew Gupta, "A new approach to bracket prediction in the NCAA Men’s Basketball Tournament based on a dual proportion likelihood"

In this paper, Gupta describes his approach to predicting the Tournament games and also does some analysis of bracket strategy under traditional (1, 2, 4, 8, 16, 32) scoring.

Gupta's prediction approach is complex.  It involves team ratings based upon maximum likelihood and what Gupta terms a "dual proportion" model.  I won't attempt to summarize the math here -- it requires several appendices in the paper itself to describe -- the interested reader should consult the paper.

In the second half of the paper, Gupta addresses how to compose a tournament selection to do well in a traditional bracket competition.  His conclusion is to pick a high-probability upset for one to three late round games.
Commentary:  This paper is poorly written and confusing from start to finish.  I'm frankly very surprised that it was chosen for publication. 

One of the major problems is that uninteresting or unoriginal ideas are inflated with confusing descriptions.  For example, the paper presents the "dual proportion model" as a novel new approach.  So what is the "dual proportion model"?  "Each of the two teams in a game has a probability of winning the game, and these must add up to 100%."  That's hardly worthy of mention, much less to be held up as a new insight. 
Another major problem is the long list of unsupported assumptions throughout the model:  a scaling parameter `beta` "that applies to big wins, meaning at least 10 points" (Why scale big wins?  Why is 10 points a big win?), "However, [log-likelihood's] shape is better for bracket prediction." (Why is it better?)  "Some wins are more indicative of future wins than others are."  (Really?  What wins?  Why?)  "Point differences can also be deceptive..."  (What is your proof of this?)  "The strength-of-schedule adjustment works by reducing the strengths of the non-tournament teams in a weak conference."  (Why?)  There are many more examples.  None of these various assumptions are given any more than a vague explanation, and worse, none are tested in any way.  The result is a pastiche of unexplained, untested ideas that likely have little or no value.
One final nitpick is that this paper doesn't seem to have anything to do with the Kaggle competition, and all of its analysis is based upon the more standard pool scoring methods.
Andrew Hoegh, Marcos Carzolio, Ian Crandell, Xinran Hu, Lucas Roberts, Yuhyun Song and
Scotland C. Leman, "Nearest-neighbor matchup effects: accounting for team matchups for predicting March Madness"

In this paper, Hoegh (et al) augment a standard strength rating-based predictive system with relative adjustments based upon how each team in the matchup has performed in similar past matchups.  So, for example, if a team is playing a very tall, good rebounding team, the model will look at the team's past performances against very tall, good rebounding teams and see if they played particularly well (or particularly poorly) against these sorts of teams in the past, and then apply that adjustment to predicting the current game.
Commentary:  This paper is well-written and presents an interesting and novel idea.  The notion of adjusting a general prediction to account for a particular matchup is at least intuitively appealing, and their approach is straightforward and easily defensible.  There are a couple of interesting issues to think about in their scheme.

First of all, how should you find past games for calculating the matchup adjustment?  Since you're trying to improve a generic strength measurement, I'd argue that ideally you'd like to find past games using some factors that aren't already reflected in the strength metric.  (Otherwise you're likely to just reinforce the information already in the strength metric.)  In this case, the authors find similar past games using a nearest-neighbor distance metric based upon twenty-one of Pomeroy's team statistics.  Some of these statistics do seem orthogonal to the strength metric (e.g., Effective Height, Adjusted Tempo) but others seem as if they would be highly correlated with the strength metric (e.g., FG percentage).  I would be interested to see some feature selection work on these statistics to see what statistics perform best on finding past games.

Second of all, testing this scheme is problematic.  The authors note that the scheme can really only be applied to the Tournament (or at least late in the season) when teams have played enough games that there's a reasonable chance to find similar past matchups.  In this case the authors have tested the scheme using Tournament games but only (if my reading is correct) looking in detail at the 2014 results.  That shows some positive benefits of the scheme, but 65 games is just too small a sample size to draw any conclusions.

Overall, I'm a little dubious that individual matchup effects exist, and that you can detect them and exploit them.  For one thing, if this were true I'd expect to see some obvious evidence of that in conference play, where teams play home-and-home.  For example, you might expect that if Team A has a matchup advantage over Team B that it would outperform expectations in both the home and away against Team B.  I haven't seen any evidence for that sort of pattern.  I've also looked at individual team adjustments a number of times.  For example, you might think that teams have individual home court advantages -- i.e., that Team A has a really good home crowd and does relatively better at home than other teams.  But I've never been able to find individual team adjustments with predictive value.  Sometimes teams do appear to have an unusually good home court advantage -- I recall a season when Denver was greatly outperforming expectations at home for the first part of the season.  But it (almost?) always turns out to be random noise in the data -- Denver's great home performance in the first part of the season evaporated in the second half of the season.
So this paper would have benefited from some more rigorous attempts to verify the existence and value of matchup effects, but it nonetheless presents and interesting idea and approach.
Lo-Hua Yuan, Anthony Liu, Alec Yeh, Aaron Kaufman, Andrew Reece, Peter Bull, Alex Franks, Sherrie Wang, Dmitri Illushin and Luke Bornn, "A mixture-of-modelers approach to forecasting NCAA tournament outcomes."

This paper discusses a number of predictive models created at Harvard for the 2014 Kaggle competition.  The final models included three logistic regressions, a stochastic gradient descent model, and a neural network.  Inputs to the models were team-level statistics from Pomeroy, Sonny Moore, Massey, ESPN and RPI.  The models were also used to build ensemble predictors.
Commentary:  This paper presents a very ordinary, not very interesting approach.  (I suspect that the Kaggle competition was used as an exercise in an undergraduate statistics course and this paper is a write-up of that experience.)  The approach uses standard models (logistic regression, SGD, neural networks) on standard inputs.  The model performances are also unusually bad.  None of the models performed as well as the baseline "predict every game at 50%" model.  Even a very naive model should easily outperform the baseline 0.5 predictor.  That none of these models did suggests very strongly that there is a fundamental underlying problem in this work.

The paper also spends an inordinate amount of time on "data decontamination" -- by which the authors mean you can't use data which includes the Tournament to predict the Tournament.  I realize that many Kaggle participants trying to use canned, off-the-shelf statistics like Pomeroy fell into this trap, but it's frankly a very basic mistake that doesn't warrant a journal publication.  The paper also makes the mistake of trying to train and test using only Tournament data.  The authors acknowledge that there isn't enough data in Tournament games for this approach to work, but persist in using it anyway.
Francisco J. R. Ruiz and Fernando Perez-Cruz, "A generative model for predicting outcomes in college basketball."

This paper extends a model previously used for soccer to NCAA basketball.  Teams have attack and defense coefficients, and the expected score for a team in a game is the attack coefficient of the team multiplied by the defense coefficient of the opponent team.  This basic model is extended first by representing each team as a vector of attack and defense coefficients, and secondly representing conferences as vectors of attack and defense coefficients as well.  The resulting model finished 39th in the 2014 Kaggle competition.  The authors also assess the statistical significance of the results of the 2014 Kaggle competition and conclude that 198 out of the 248 participants are statistically indistinguishable.  This agrees with the similar analysis in the Lopez paper.

Commentary: The approach used in this paper is similar to the one used by Danny Tarlow in the 2009 March Madness contest, although with a more mathematically sophisticated basis.  (Whether that results in better performance is unclear.)  The authors give an intuitively appealing rationale for using vectors of coefficients (instead of a single coefficient) to represent teams: "Each coefficient may represent a particular tactic or strategy, so that teams can be good at defending some tactics but worse at defending others (the same applies for attacking)."  It would have been helpful to have a direct comparison between a model with one coefficient and multiple coefficients to see if this in fact has any value.  Similarly, the idea of explicitly representing conferences has some appeal (although it's hard to imagine what reality that captures) but without some test of the value of that idea it remains simply an interesting notion.  Although the basic ideas of this paper are interesting, the lack of any validation is a major weakness.

Friday, February 20, 2015

Kaggle Competition: From Point Spreads to Win Percentage

The Kaggle Competition asks competitors to estimate the win probabilities for all the possible Tournament games.  But for reasons that I'm sure have nothing to do with gambling, many systems -- including mine -- predict Margin of Victory (MOV) rather than probability of winning.  So how does one convert a predicted MOV to a win probability?

I started by creating a histogram of predicted MOV versus win probability for 31K games in my training set.  (I used predictions from my own system, but you could also do this with the opening or closing Vegas lines.)  I binned at 1 point intervals to get the following graph:

This shows that no team predicted to lose by 18 or more points ever won a game, that teams predicted to win by 0 points won half the time, and that teams predicted to win by 25 or more points won every time.  That seems pretty reasonable, and it's reassuring that the graph crosses zero at about 50 percent.

I could use this data directly to translate from predicted MOV to win probabilities.  If I predict a team is going to win by 10 points I could use this chart to see that its win probability is 83.1% and use that in my Kaggle entry.

There are a couple of minor problems with this.  First, even with 31K games there's some obvious noise in the data, particularly at the tail ends of the ranges.  Second, I'm allowed two Kaggle entries, so I might want to create my second entry by tweaking this curve.  That will be hard to do working with the raw data like this.  For these reasons, I'd like a formula for mapping from predicted MOV to win probability.

A simple solution is to do a linear regression on the middle part of the graph.

The result is a pretty good fit.  This also reveals there's a little bias in my predictions.  If the predictions were perfectly unbiased the constant term in this equation would be 0.50.

However, there's a pretty obvious S shape to this curve that the linear equation is not capturing.  I could fit with higher order polynomial (a fourth order equation fits almost perfectly) but there's good reason to believe that what we're really seeing here is a cumulative normal distribution.  (That's the familiar bell-shaped curve -- if I were to plot this as a difference between the predicted MOV and the actual MOV that's exactly what we'd see.)

So let's try fitting a cumulative normal distribution to the data.

That's a pretty good-looking match.  I did this just by eyeballing the data and picking 0 as the mean and 10 as the standard deviation, but if you want more precision you can do some more complex analysis to get a better fit.  (Not unsurprisingly, this corresponds very closely with my mean bias and RMSE for this season.)

Whether you use a simple linear equation or something more complex, you can now tweak your equation to create a new strategy.  For example, if I tweak my normal distribution to have a standard deviation of only 6, I get a new curve:

The effect of using this curve is to increase the confidence in my picks -- essentially to gamble that I'm going to be right more than I have been in the past. 

Monday, February 16, 2015

Guest Post: Todd Beck (

Today is the second in what I hope will be a series of guest postings.  I started submitting picks this year to The Prediction Tracker, and I became curious about Todd Beck, who runs the site, so I sent him a series of questions and I'm pleased to present his responses!  -- Scott

I grew up in a small town south of Dallas Texas.   I attended Texas A&M University where I got a undergrad degree in Math and a graduate degree in Statistics.    For some reason I had a interest in trying to predict football games even way back when I was a kid.  I used to have a notebook where I guessed at every NFL game for years.  So that life-long interest and the fact that our statistics department had a college football pool got me started in coming up with an automated way to make predictions for any game.

For the 19 years since leaving college I have been a statistician at a hospital/medical school in Chicago, primarily in the research area of Alzheimer's disease and other aging related issues.

Most of the people who get involved in sports predictions are sports fans.  Do you have a rooting interest in any particular teams?  What's your favorite sport?

I was definitely a sports fan growing up.  Being close to Dallas I grew up a fan of the Cowboys (Boo! -- Scott), Rangers, and Mavericks.   Even though I have lived in Chicago for a long time now I have never been able to get into any of there teams and haven't been able to follow baseball or basketball too much, but I remain a diehard Cowboys fan.

My favorites sports to play was always baseball for some reason.   These days I really only follow pro and college football. provides a tremendous service to prediction community, but it seems like it must be a ton of work.  What's your workflow like and how much time do you spend each week on it?  I know you run ads on your site -- does that recoup your costs?  How about your donations button -- does that bring in much support?

Because a majority of the things on are automated it doesn't require a lot of actual hands on time. The most time is always in the first couple of weeks of a season as I look to see which systems are going to continue on and potentially adding any new ones.    I have basketball completely automated so it really takes no time at all once it is running.  Football is a little different because it is a weekly event.  Collecting the data is automated but I still manually run the program each day.  So Monday usually needs a little more time to set up the new week of college games and Tuesday more time to set up for the NFL.

Initially I was never in favor of having ads.  But I started on a free web server but I was soon hitting the bandwidth limits.  So I switched to a different service and before long I started hitting limits there too.  So if I wanted to continue the site I had to get a permanent domain name and start paying for the server.  My traffic has continued to grow ever year and so the ad revenue has grown with it.

Originally with ads plus donations I was making just about the same amount as it was costing me, only now with a better ad service and with a larger audience am I bringing in more.  But it is definitely not enough to quit the day job.

You don't seem to have a predictor or ranking system of your own.  Do you use the information you gather on yourself?  How successful has that been?

I do not have any basketball systems of my own but I do several for football.  My completely original system is the PerformanZ Ratings.   But I also maintain Elo and Pythagorean ratings that I have modified for football.  Plus there are all of the regression based rankings: Least squares regression, least absolute value regression and logistic regression.  After so many years of doing this I still maintain that a simple least squares model is as good as anything out there.  There is a scoring efficiency algorithm that I learned a long time from a book in my public library written by a professional gambler.  And also a simple points per game comparison which doesn't do that bad.

But in some ways I guess you could say that the 'system' that I am most known for is the composite mean.  Calculating the mean or median of all, or most, of the ratings, especially when you have 70+ of them, is always going to end up performing better than most individual systems.

I don't do any real gambling, any simulated gambling I have experimented with has almost always been using the system average. On my blog page I generally post probabilities (for football) that are based off the system average.  Tracking the top picks of the week over the season almost always gives results in the 55-60% range for at least one of pro or college football.  The NFL picks have been 62% each of the last two years.  A terrible season pops up every few years for some reason I haven't figured out yet. Like this past year the NCAA picks were only 35%.  I experimented with the same idea in college basketball last season and had huge success in the first half of the season before giving it all back once conference play started.

Friday, February 13, 2015

Machine Madness Update

Monte McNair has kindly offered to host this year's Machine Madness contest on the Ultimate Bracket Challenge website.  If you're thinking about competing, head over there to create an account and familiarize yourself with the site!

Using Ultimate Bracket Challenge will also give us the opportunity to modify the scoring.  The "traditional" scoring approach of 1-2-4-8-16-32 is heavily weighted to the late rounds.  In some ways that can be good -- previous MM contests have gone down to the last game -- but in some ways it's bad -- you can be out of the contest fairly early.  The Kaggle Competition is at the other end of the spectrum.  All games are treated equally, and the competitors predict all possible games, so no one is ever "out of it".  That's interesting, too, but since Kaggle has that covered I don't want to swing too far in that direction. 

Ultimate Bracket Challenge also has the option of a market-type approach, where points are divided up between the number of people that correctly picked the winner for that game.  That might be an option for us as well.

If you have any thoughts on how to score this year's Machine Madness leave a comment or send me an email.

I'm also thinking about prizes, so if you know of a company who'd like to offer a prize as part of the contest in exchange for the massive marketing value of being mentioned on this blog :-), please let me know that as well!

Wednesday, February 4, 2015

Strength of Schedule & Adjusted Statistics (Part 5)

The method for calculating adjusted statistics I outlined in previous posts turns out to be moderately useful.  It's better for prediction than raw statistics, but not as useful as some other approaches.

One problem is that my approach doesn't explicitly provide a measure for defense against a statistic.  We get an adjusted statistic for (say) 3 pt % that tells us how good a team is relative to other teams for that statistic, but we don't get a measure of how good that team is at defending the 3 pt %.  We can remedy this with a slightly different approach.

In this new approach, we assume that the performance for a team in a particular game is a function of both the team's offensive strength and its opponent's defensive strength.

`S_"ij" = (O_i)/(D_j)`

`S_"ij"` represents the value of the statistic in a game between the offensive team (i) and the defensive team (j).  `O_i` then represents the offensive team's adjusted strength at this statistic, and `D_j` represents the defensive team's adjusted strength at defending this statistic.

As an example of this approach, assume we have the following schedule of games and performances:


This yields six equations for the Offensive and Defensive strength ratings:

`0.43 = O_g / D_s`            `0.30 = O_s / D_g`
`0.35 = O_g / D_b`            `0.23 = O_b / D_g`
`0.28 = O_s / D_b`             `0.26 = O_b / D_s`

and solving these equations yields

`O_g = 0.50`     `D_g = 0.75`
`O_s = 0.40`     `D_s = 0.85`
`O_b = 0.30`     `D_b = 0.70`

In general we'll have many more games than teams, and there won't be an exact solution for this system of equations.  Solving the system can be done by an iterative approach similar to the one described for the previous system; assign some values to O and D and alternately recompute O and D until the values converge.

This approach is similar to a ranking method described in [Govan 2009].  (You can get this paper from the Paper Archive.)  Govan describes the conditions under which the ratings will converge (hint: they will for NCAA basketball after a few hundred games) and  a method of calculation.  Govan's approach isn't exactly the same as described here, I leave it to the reader to work out the differences and how to address them.

In the next posting I'll talk about yet another alternative model for adjusted statistics and how that can be calculated.

Monday, February 2, 2015

Kaggle Competition & Machine Madness

The second annual Kaggle competition has been announced:
Kaggle & HP are hosting our second annual NCAA March Mania prediction competition. All methods, flavors, credos, and styles of prediction are legal, guessing included. Here's how it works:
  • You're given team-level data going back three decades, and are welcome to use whatever data you want
  • We have a historical "warm up" stage so you can get used to the data format and have some motivation to work on the problem before March
  • On Selection Sunday (when teams are announced), we wipe the leaderboard and collect 2015 predictions
  • Entries are judged on the log loss of predicted probabilities for every possible matchup in the tournament.
  • We update scores as the games unfold
Have fun!
The home page for the competition is here.

I'll also remind everyone that I'll be hosting the annual Machine Madness Competition. Machine Madness has been going on since 2010. Unlike the Kaggle competition, you make one entry into a Yahoo pool and compete under the usual pool rules. So the strategy is very different from the Kaggle competition.  And you're probably not going to win either competition, so it's worth it to take a few extra minutes and join the Machine Madness competition as well :-)

Sunday, February 1, 2015

Strength of Schedule & Adjusted Statistics (Part 4)

In the first three parts of this series, I've explained the problem with comparing statistics between teams (because they don't all play the same opponents), described how to measure the strength of a team's schedule, and then how to use that to calculate an "adjusted" statistic that provides a more accurate comparison between teams.  In this post, I'll discuss how to implement this method.

Many metrics of this sort that use symmetric statistics (such as winning percentage) can be expressed as a system of linear equations, which can be solved exactly by easily available software.  However, for asymmetric statistics, the strength of schedule ends up in the denominator of the adjusted statistic, resulting in a non-linear equation, which cannot generally be solved exactly.  We must use a numerical method, such as the iterative method.

The basic idea of the iterative method is to repeatedly recalculate the adjusted statistic using the current values of the adjusted statistic until the numbers stop changing (i.e., the solution converges on an answer).  If the new value of the adjusted statistic is `S'_"adj"` and the old values are `S_"adj"` then the iterative equation is:

`S'_"adj"(T) = (S(T))/ ((1/(n*m)) sum_(i="opponents"(T))^n sum_(j="opponents"(i))^m (j ne T) S_"adj"(j))`

(If this is confusing, see the example in my previous post.)  So now let's talk about how we would implement this iterative calculation.

The term in the numerator of this equation -- `S(T)` -- is a constant.  It's just the unadjusted value of the statistic S for team T.  The work of the calculation is in the denominator, which represents the average adjusted statistic for all of team T's opponents' opponents.  So how do we calculate this?  The first step is to figure out team T's opponents' opponents.

It turns out that this problem is related to graph theory.  We can represent a season of games as a graph where each node is a team and each edge represents a game.  Suppose that we have a league with four teams: Gold, Blue, Silver and Green.  Each of these will be represented as a node in our graph.  Gold has played Blue, Blue has also played Silver, and Silver has played Green twice.  Each of these games is represented as an edge:

To find all the opponents of a team, we start at the node for the team and traverse one step along every path leading out of the node.  Starting at Gold, our only choice is to walk to Blue -- and that is Gold's only opponent so far.  If we want to find the opponents' opponents, we walk two steps.  For Gold, the first step takes us to Blue and then we have two choices for our second step:  to Silver or back to Gold.  And indeed, Blue (Gold's only opponent) two opponents have been Gold and Silver. 

In graph theory, a graph of the sort pictured above is represented by an adjacency matrix.  For a graph with N nodes, this is simply an NxN matrix where the (i,j) entry is 1 if there is an edge between those two nodes, and a 0 otherwise.  If we let Gold=0, Blue=1, Silver=2 and Green = 3, the adjacency matrix for the graph above is:


Silver has played Green twice, so we put a 2 in those spots.  Notice that this matrix is symmetric (i.e., `a_"ij" = a_"ji"`).  For our problem, the matrix will always be symmetric, because if Team A has played Team B, then naturally Team B has played Team A.

An interesting thing about representing a graph this way is that if we multiply the adjacency matrix by itself, the result is a matrix that tells us how many two step paths connect each pair of nodes.  (And if we multiply it three times, it gives us the three step paths, and so on.)  So for our example:

`[[0,1,0,0],[1,0,1,0],[0,1,0,2],[0,0,2,0]] * [[0,1,0,0],[1,0,1,0],[0,1,0,2],[0,0,2,0]] =[[1,0,1,0],[0,2,0,2],[1,0,5,0],[0,2,0,4]]`

The first row of the result shows us that Gold has had itself and Blue as it's two opponents' opponents, just as we saw above.  If you look at the second row, you'll see that Blue has had Green as an opponents' opponent twice, because there are two edges from Silver to Green.

This provides a straightforward way to calculate any team's opponents' opponents.  We keep a matrix showing which teams have played each other, and then multiply it by itself to get the opponents' opponents.

If you look back at our equation for the adjusted statistic, you'll see that there's a slight twist in the denominator.  We're ignoring any case where a team's opponents' opponent is the team itself:

`S'_"adj"(T) = (S(T))/ ((1/(n*m)) sum_(i="opponents"(T))^n sum_(j="opponents"(i))^m (j ne T) S_"adj"(j))`

This is easy to do in our matrix formulation.  These are the diagonal entries in the answer matrix, and we can ignore them by setting them to zero.  For our example above:


I'll going to call this the Modified Opponents' Opponents Matrix (MOOM).

The next step in our calculation is to sum up the adjusted statistic for all the opponents' opponents.  It turns out there's an easy way to do this.  We simply multiply the MOOM by a matrix listing the adjusted statistics of all the teams.  So if these are the unadjusted statistic values that we will use as our first guess:

Team UnAdj

Then our matrix multiplication is:

`[[0,0,1,0],[0,0,0,2],[1,0,0,0],[0,2,0,0]]*[[0.34],[0.24],[0.30],[0.15]] = [[0.30],[0.30],[0.34],[0.48]]`

The resulting matrix provides the sum of the opponents' opponents adjusted statistics for each team.  For example, the first row (0.30) is the sum of all of Gold's opponents' opponents adjusted statistic.  Since Gold's only opponent's opponent was Silver, this is just Silver's adjusted statistic.

Now we have to average each of these values by dividing them by the number of opponents' opponents.  There's an easy way to get this number as well.  We multiply the MOOM by a column of ones:

`[[0,0,1,0],[0,0,0,2],[1,0,0,0],[0,2,0,0]]*[[1],[1],[1],[1]] = [[1],[2],[1],[2]]`

The resulting matrix provides the number of opponents' opponents for each team.  Now dividing each of the sums by each of the counts gives us the Strength of Schedule for each team (the denominator in the original equation).

`[[0.30/1],[0.30/2],[0.34/1],[0.48/2]] = [[0.30],[0.15],[0.34],[0.24]]`

And we can use that to calculate new values for the adjusted statistic:


We then repeat that process until the adjusted statistics converge.

(NOTE:  As it happens, the above example doesn't actually converge.  As noted in the previous post, this method seems to work for NCAA basketball after about 500 games in the season, but I have not proven that it will converge in all cases.)

Finally, notice that the MOOM and the count of the opponents' opponents do not change during the iteration process.  The only things we need to calculate during the iteration process is the new sum of the adjusted statistics, the average, and then the new adjusted statistics.

In the next post I'll talk about how well this works in practice and some alternative approaches.