Monday, March 23, 2015

Machine Madness Update

I'm just back from watching UCLA win two games in Louisville and am not yet caught up, but here's a quick update from the Machine Madness side of the competition.

Perhaps not unsurprisingly, Monte McNair leads the competition with 56 points and I suspect will win if Kentucky wins out.  (Monte's in the Top Ten in the Kaggle competition right now.)   BlueFool is in second place just a point behind Monte and the only competitor with Duke as champion, so she'll likely win if that happens.  Jason Sumpter is in third place but has Kentucky as champion, so he'll need some breaks to beat out Monte -- specifically, I think he needs Xavier to beat Wisconsin.

Nothing But Neural Net (great name, btw) is the only competitor with Wisconsin as champion.  Likewise I'm the only competitor with Arizona, so obviously we'll be rooting for those teams to win out. 

Thursday, March 19, 2015

Good Luck!

I'm off in Louisville to watch the first round games, so updating the blog is difficult, but I wanted to wish good luck to everyone in both the Kaggle and the Machine Madness contests.  Enjoy the games!

Monday, March 9, 2015

The Silliness of Simulation

When the NCAA Tournament rolls around there's an inevitable flurry of blog posts and news articles about some fellow or another who has predicted the Tournament outcome by running a Tournament simulation a million times!  Now that's impressive!

Or maybe not.

These simulations are nothing more than taking someone's win probabilities (usually Pomeroy or Sagarin, since these are available with little effort) and then rolling a die against those probabilities for each of the 63 games.  On a modern computer you can do this a million times in a second with no real strain.

More importantly, though, does running this sort of simulation a million times actually reveal anything interesting?

Imagine that we decided to do this for just the title game.  In our little thought experiment, the title game this year has (most improbably) come down to Duke versus Furman, thanks in no small part to Furman's huge upset of the University of Kentucky in their opening round game.

(Furman -- one of the worst teams in the nation and who have only managed 5 wins in the lowly Southern Conference -- has somehow won through to the conference title game and actually does have a chance to get to the Tournament.  If this happens, they'll undoubtedly be the worst 16 seed and matched up against UK in Louisville.  So this is totally a plausible scenario.)

We look up the probability of Duke beating Furman in our table of Jeff Sagarin's strengths (or Ken Pomeroy, whomever it was)  and we see that Duke is favored to win that game 87% of the time.  So now we're ready to run our simulation.

We run our simulation a million times.  No, wait.  We want to be as accurate as possible for the Championship game, so we run it ten million times.

(We have plenty of time to do this while Jim Nantz narrates a twenty minute piece on the unlikely Furman Paladins and their quixotic quest to win the National Championship.  This includes a long interview with a frankly baffled Coach Calipari.)

We anxiously watch the results tally as our simulation progresses.  (Or rather we don't, because the whole thing finishes before we can blink, but I'm using some dramatic license here.)  Finally our simulation is complete, and we proudly announce that in ten million simulated games, Duke won 8,700,012 of the games!  Whoo hoo!

But wait.

The sharp-eyed amongst you might have noticed that Duke's 8,700,012 wins out of a 10,000,000 is almost the same percentage as our original winning probability that we borrowed from Ken Pomeroy.  (Or Jeff Sagarin, whomever it was.)  Well, no kidding.  It had better be, or our random number generator is seriously broken.

Welcome to the Law of Large Numbers.  To quote Wikipedia:  "[T]he average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed."  The more times we run this "simulation" the closer we'll get to exactly 87%.

This is why the whole notion of "simulating" the tournament this way is silly.  The point of doing a large number of trials (simulations) is to reveal the expected value.  But we already know the expected value:  it's the winning probability we stole from Jeff Sagarain.  (Or Ken Pomeroy, whomever it was.)  It's just a waste of perfectly good random numbers to get us back to the place we started.

To be fair, there's one reason that it makes some sense to do this for the entire Tournament.  If for some reason you want to know before the Tournament the chances of a particular team winning the whole thing, then this sort of simulation is a feasible way to calculate that result.  (Or if you're Ed Feng you create this thing.)  And if that's your goal, I give you a pass.

On the other hand, if you're doing all this simulation to fill out a bracket for (say) the Machine Madness competition, then it makes more sense to run your simulation for a small number of trials.  The number of trials is essentially a sliding control between Very Random (1 trial) and Very Boring (1 billion trials) at the other end.  Arguably it is good meta-strategy in pool competitions not to predict the favorite in every game, so by lowering the number of trials you can inject some randomness into your entry.  (I don't think this is necessarily a good approach, but at least it is rational.)

Now I'm off to root for Furman in the Southern Conference title game.

Wednesday, March 4, 2015

So What About Me?

I've recently put up a few posts about the Kaggle competition including one about reasonable limits to performance in the contest.  So it's natural to wonder how I'm doing / have done in the Kaggle competition.

Fair enough.

Last year, my entry ended up finishing at 60th on the Kaggle leaderboard, with a score of 0.57.  At one point that was exactly at the median benchmark, but apparently the post-contest cleanup of DQed entries changed that slightly.  2014 wasn't a particularly good year for my predictor.   Here are the scores for the other seasons since 2009:


2014 was my worst year since 2011.  (2011 was the Chinese Year of the Upset, with a Final Four of a #3, #4, #8 and #11 seed.)  Ironically, I won the Machine Madness contest in 2011 because my strategy in that contest includes predicting some upsets; this led to correctly predicting Connecticut as the champion.

My predictor isn't intended specifically for the Tournament.  It's optimized for predicting Margin of Victory (MOV) for all NCAA games.  This includes the Tournament, but those games are such a small fraction of the overall set of games that they don't particularly influence the model.  There are some things I could do to (hypothetically) improve the performance of my predictor for the Kaggle competition.  For one thing, I could build a model that tries to predict win percentages directly, rather than translating from predicted MOV to win percentage.  Secondly, since my underlying model is a linear regression, I implicitly optimize RMSE.  I think it's likely that a model that optimizes on mean absolute error would do better1 but I haven't yet found a machine learning approach that can create a model optimized on mean absolute error with performance equaling linear regression.

I haven't put much effort into building a "Tournament optimized" predictor because (as I have pointed out previously) there is a large random element to the Tournament performance.  Any small gains I might make by building a Tournament-specific model would be swamped by the random fluctuations in the actual outcomes.

1 I say this because RMSE weights outliers more heavily.  Although there are a few matchups in the Tournament between teams with very different strengths (i.e., the 1-16 and 2-15 matchups in particular) in general you might suppose that there are fewer matchups of this sort than in the regular season, and that being slightly more wrong on these matchups won't hurt you much if you're also slightly more correct on the rest of the Tournament games.  That's just speculation on my part, though.

Monday, March 2, 2015

Reasonable Kaggle Performance

The first stage of the Kaggle competition involves Kagglers testing out their models against data from the past few basketball seasons, and these scores appear on the first stage leaderboard.  Invariably new Kagglers make some fundamental mistakes and end up submitting entries with unreasonably good performance.  The administrators of the contest have taken to removing these entries to avoid discouraging other competitors.  The line for removing entries is somewhat fuzzy, and it begs the question1 "What is a reasonable long-term performance for a Tournament predictor?" There are probably many ways to answer this question,2 but here's one approach that I think is reasonable:  Calculate the performance of the best possible predictor over an infinite number of Tournaments.

I am reminded at this point of an old joke.
A man is sitting in a bar complaining to his friend -- who happens to be a physicist -- about his awful luck at the racing track, and wishing he had some better way to know what horse was going to win each race.  

"Well, that strikes me as a rather simple physics problem," his friend says.  "I'm sure I could build a model to predict the outcome."

"Really?" says the man, visibly excited.  "That's fantastic.  We'll both get rich!"

So the physicist goes off to build his model.  After a week, the man has still heard nothing, so he calls his friend.  "How are you doing on the model?" he asks.

"Well," says the physicist.  "I admit that it is turning out to be a bit more complicated than I imagined.  But I'm very close."

"Great," says the man.  "I can't wait!"

But another week goes by and the man hears nothing, so he calls again.

"Don't bother me," snaps the physicist.  "I've been working on this day and night.  I'm very close to a breakthrough!"

So the man leaves his friend alone.  Weeks pass, when suddenly the man is awakened in the middle of the night by a furious pounding on his front door.  He opens the door and sees his friend the physicist.  He looks terrible -- gaunt and strained, his hair a mess -- and he is clutching a sheaf of crumpled papers.  "I have it!" he shouts as the door opens.  "With this model we can predict the winner of any horse race!"

The man's face lights up.  "I can't believe you did it," he says.  "Tell me how it works."

"First of all," says the physicist, "we assume that the horses are perfect spheres racing in a vacuum..."
Like the physicist, we face a couple difficulties.  For one thing, we don't have the best possible predictor.  For another, we don't have an infinite set of Tournaments.  No matter, we shall push on.

We don't have the best possible predictor (or even know what its performance would be) but we do have some data from the best known predictors and we can use that as a substitute.  The Vegas opening line is generally acknowledged to be the best known predictor (although a few predictors do manage to consistently beat the closing line, albeit by small margins).  The Vegas opening line predicts around 74% of the games correctly "straight up" (which is what the Kaggle contest requires). I'm personally dubious that anyone can improve upon this figure significantly3 but for the sake of this analysis let's assume that the best possible predictor can predict an average game4 correctly 80% of the time.

We also don't have an infinite number of Tournaments to predict, but we can assume that the average score on an infinite number of Tournament games will tend towards the score on an average Tournament game.  For the log-loss scoring function, the best score in the long run comes from predicting our actual confidence (the 80% from above).  If we predict an infinite number of games at 80% and get 80% of those games correct, our score is:

`0.80*log(0.80) + (1-0.80)*log(1-0.80)`

which turns out (fortuitously) to be just about 0.50.  (If we use a performance of 74%, the score is about 0.57.)

This analysis suggests that the theoretical best score we can expect predicting a large number of Tournament games is around 0.50 (and probably closer to 0.57).  This agrees well with last year's results -- the winner had a score of about 0.52 and the median score was about 0.58.

As far as "administrative removal" goes, there are 252 scored games in the Kaggle stage one test set.  That's not an infinite set of games, but it is enough to exert a strong regression towards the mean.  The Kaggle administrators are probably justified in removing any entry with a score below 0.45.

On a practical level, if your predictor is performing significantly better than about 0.55 for the test data, it strongly suggests that you have a problem.  The most likely problems are that you are leaking information into your solution or that you are overfitting your model to the test data.

Or, you know, you could be a genius.  That's always a possibility.

1 Yes, I know I'm misusing  "beg the question". 
2 I suspect that a better approach treats the games within the Tournament as a normal distribution and sums over the distribution to find the average performance, but that's rather too much work for me to attempt.
3 If for no other reason than Vegas has a huge financial incentive to improve this number if they could.  
4 The performance of the Vegas line is an average over many games.  Some games (like huge mismatches) the Vegas line predicts better than 74%; some (like very close matchups) it predicts closer to 50%.  I'm making the simplifying assumption here that the average over all the games corresponds to the performance on an average game.  Later on I make the implicit assumption that the distribution of Tournament games is the same as the distribution of games for which we have a Vegas line.  You can quarrel with either of these assumptions if you'd like.  A quick analysis of the Tournament games since 2006 shows that the Vegas line is only right 68% of the time, suggesting that Tournament games may be harder to predict than the average game.

Friday, February 27, 2015

Five Mistakes Kaggle Competitors Should Avoid

#1 -- Don't think you're going to win the competition.

One of the results that came out of the analysis of last year's contest is that the winner was essentially random:  at least the top 200 entries could have credibly won the contest.  Why?  Evidence from the best predictors suggests that there is about 8 points or so of unpredictability in college basketball games.  That's a lot of randomness.  Last year, 32 of the 64 games in the Tournament were decided by 8 points or less.  So even if you have the most accurate predictor in the contest, you're almost certain to be beaten by someone who made a worse prediction and got lucky when it came true.  It's the same reason why the ESPN pool is usually won by an octopus or someone who picked based on mascot fashions. On the other hand, maybe this year you'll be the guy who gets lucky.  It could happen.

#2 -- Don't use the data from before the 2008-2009 season.

Isn't it nice of the Kaggle administrators to provide data back to 1985?

If you're not familiar with college basketball, you might not realize that the college game underwent a radical change at the beginning of the 2008-2009 season when the NCAA instituted the three-point shot at a consistent distance of 20' 9".  The three-point shot created whole new game strategies, and data from before that season is probably not easily applicable to today's game.

#3 -- The Tournament data is not enough for training or testing.

More like March Sadness

At 64 games a year, the Tournament just doesn't provide enough data for training or even testing a predictor with any reliability.  You may think you're being smart to build your model specifically for the Tournament -- imagine the advantage you'll have over all the other competitors that don't understand how different the Tournament is from the regular season.  Ha!

But actually you're just overfitting your model.   My own predictor needs about 15,000 training examples for best performance.  Your mileage may vary -- maybe you only need 14,000 training examples -- but there just isn't enough information in the Tournament games alone to do accurate prediction.  Particularly since you shouldn't use the games from before 2008 (see #2).  Of course, you can do all that and you might still win the contest (see #1).

#4 -- Beware of leakage!

Guess what?  It turns out that you can do a really good job of predicting the Tournament if you know the results ahead of time.  Who knew?

Now that's not a big problem in the real contest because (short of psychic powers) no one knows the results ahead of time.  But if the forums from last year and this year are any indication, it's a big problem for many Kagglers as they build and test their predictors.  Knowledge from the games they're testing creeps into the model and results in unrealistically good performance.

A First-Time Kaggle Competitor
There are three major ways this happens.

The first and most obvious way this happens is that a model is trained and tested on the same data.  In some cases you can get away with doing this -- particularly if you have a lot of data and a model without many degrees of freedom.  But that isn't the case for most of the Kaggle models.  If you train your model on the Tournament data and then test it on the same data (or a subset of the data), it's probably going to perform unreasonably well.  You address this by setting aside the test data so that it's not part of the training data.  For example, you could train on the Tournament data from 2008 to 2013 and then test on the 2014 Tournament.  (Although see #3 above about using just the Tournament data.)  Cross-validation is another, more robust approach to avoiding this problem.

The second way this often happens is that you unwittingly use input data that contains information about the test games.  A lot of Kagglers use data like Sagarin's ratings without understanding how these statistics are created.  (I'm looking at you, Team Harvard.)  Unless you are careful this can result in information about the test games leaking back into your model.  The most common error is using ratings or statistics from the end of the season to train a model for games earlier in the season.  For example, Sagarin's final ratings are based upon all the games played that season -- including the Tournament games -- so if you use those ratings, they already include information about the Tournament games.  But there are more subtle leaks as well, particularly if you're calculating your own statistics.

The third and least obvious way this happens is when you tune your model.  Imagine that you are building your model, taking care to separate out your test data and avoid using tainted ratings.  You test your model on the 2014 Tournament and get mediocre results.  So you tweak one of your model parameters and test your model again, and your results have improved.  That's great!  Or is it?  In fact, what you've done is leak information about the 2014 Tournament back into your model.  (This can also be seen as a type of overfitting to your test data.)  This problem is more difficult to avoid, because tuning is an important part of the model building process.  One hedge is to use robust cross-validation rather than a single test set.  This helps keep your tuning more general.

How can you tell when you're suffering from leakage?  Your performance can provide an indicator.  Last year's winner had a log-loss score of 0.52, and the median score was around 0.58.  If your predictor is getting performance significantly better than those numbers, then you're either (1) a genius, or (2) have a problem.  It's up to you to decide which.

#5 -- A Miscellany of Important Notes
  • College basketball has a significant home court advantage (HCA).  (And yes, there may be a home court advantage in Tournament games!) Your model needs to account for the HCA and how it differs for neutral court and Tournament games.  If your model doesn't distinguish home and away, you've got a problem.
  • College teams change significantly from season to season.  You can't use a team's performance in one season to predict its performance in another season.  (This seems obvious, but last year's Harvard team seems to have made this mistake.  On the other hand, they got a journal publication out of it, so if you're an academic this might work for you too.)
  • Entering your best predictions might not be the best way to win the contest.  Since the contest has a large random element (see #1 above) your best strategy might be to skew your predictions in some way to distinguish yourself from similar entries, i.e., you should think about meta-strategy.

Wednesday, February 25, 2015

JQAS Paper Reviews

Some of you who participated in last year's Kaggle contest may remember that the Journal of Quantitative Analysis in Sports (JQAS) solicited papers on the methods contestants used to predict basketball game outcomes in the NCAA tournament as part of the Kaggle contest.  The next issue of JQAS will contain five papers that resulted from this solicitation and the publisher has made the papers freely downloadable for a month after the issue is published as well as while they are posted in the "Ahead of Print" section on the JQAS site.  (I have also added them to my Papers archive.)  Below are short reviews of the five papers.

Michael J. Lopez  and Gregory J. Matthews, "Building an NCAA men's basketball predictive model and quantifying its success."

Lopez and Matthews won the 2014 Kaggle Contest.  The paper describes their approach as well as an analysis of how "lucky" their win was.

Lopez and Matthews employed a two-pronged prediction approach based upon (1) point spreads and (2) efficiency ratings (from Pomeroy).  They built separate regression models for points spreads and the efficiency ratings and combined them in a weighted average for their two submissions:  One that weighted point spreads at 75% and efficiency ratings at 25%, and one vice versa.  Since point spreads were only available for the first 32 games, Lopez & Matthews estimated the point spreads for the remaining games using an undescribed "proprietary" model.

Lopez & Matthews also analyzed how "lucky" they were to win the contest.  Their analysis suggests that luck is by far the biggest factor in the competition.  For example, they found that about 80% of the entries could have won the contest under different reasonable outcomes, and the true probability of their entry being the best was less than 2%.
Commentary:  While I appreciate that Lopez & Matthews took the time to write up their experience, I find myself disappointed that this approach ended up winning; it brings nothing novel or interesting to the problem.  Their analysis in the second part of the paper is interesting -- it confirmed my belief that the Kaggle contest was essentially a random lottery amongst the top few hundred entries.
Ajay Andrew Gupta, "A new approach to bracket prediction in the NCAA Men’s Basketball Tournament based on a dual proportion likelihood"

In this paper, Gupta describes his approach to predicting the Tournament games and also does some analysis of bracket strategy under traditional (1, 2, 4, 8, 16, 32) scoring.

Gupta's prediction approach is complex.  It involves team ratings based upon maximum likelihood and what Gupta terms a "dual proportion" model.  I won't attempt to summarize the math here -- it requires several appendices in the paper itself to describe -- the interested reader should consult the paper.

In the second half of the paper, Gupta addresses how to compose a tournament selection to do well in a traditional bracket competition.  His conclusion is to pick a high-probability upset for one to three late round games.
Commentary:  This paper is poorly written and confusing from start to finish.  I'm frankly very surprised that it was chosen for publication. 

One of the major problems is that uninteresting or unoriginal ideas are inflated with confusing descriptions.  For example, the paper presents the "dual proportion model" as a novel new approach.  So what is the "dual proportion model"?  "Each of the two teams in a game has a probability of winning the game, and these must add up to 100%."  That's hardly worthy of mention, much less to be held up as a new insight. 
Another major problem is the long list of unsupported assumptions throughout the model:  a scaling parameter `beta` "that applies to big wins, meaning at least 10 points" (Why scale big wins?  Why is 10 points a big win?), "However, [log-likelihood's] shape is better for bracket prediction." (Why is it better?)  "Some wins are more indicative of future wins than others are."  (Really?  What wins?  Why?)  "Point differences can also be deceptive..."  (What is your proof of this?)  "The strength-of-schedule adjustment works by reducing the strengths of the non-tournament teams in a weak conference."  (Why?)  There are many more examples.  None of these various assumptions are given any more than a vague explanation, and worse, none are tested in any way.  The result is a pastiche of unexplained, untested ideas that likely have little or no value.
One final nitpick is that this paper doesn't seem to have anything to do with the Kaggle competition, and all of its analysis is based upon the more standard pool scoring methods.
Andrew Hoegh, Marcos Carzolio, Ian Crandell, Xinran Hu, Lucas Roberts, Yuhyun Song and
Scotland C. Leman, "Nearest-neighbor matchup effects: accounting for team matchups for predicting March Madness"

In this paper, Hoegh (et al) augment a standard strength rating-based predictive system with relative adjustments based upon how each team in the matchup has performed in similar past matchups.  So, for example, if a team is playing a very tall, good rebounding team, the model will look at the team's past performances against very tall, good rebounding teams and see if they played particularly well (or particularly poorly) against these sorts of teams in the past, and then apply that adjustment to predicting the current game.
Commentary:  This paper is well-written and presents an interesting and novel idea.  The notion of adjusting a general prediction to account for a particular matchup is at least intuitively appealing, and their approach is straightforward and easily defensible.  There are a couple of interesting issues to think about in their scheme.

First of all, how should you find past games for calculating the matchup adjustment?  Since you're trying to improve a generic strength measurement, I'd argue that ideally you'd like to find past games using some factors that aren't already reflected in the strength metric.  (Otherwise you're likely to just reinforce the information already in the strength metric.)  In this case, the authors find similar past games using a nearest-neighbor distance metric based upon twenty-one of Pomeroy's team statistics.  Some of these statistics do seem orthogonal to the strength metric (e.g., Effective Height, Adjusted Tempo) but others seem as if they would be highly correlated with the strength metric (e.g., FG percentage).  I would be interested to see some feature selection work on these statistics to see what statistics perform best on finding past games.

Second of all, testing this scheme is problematic.  The authors note that the scheme can really only be applied to the Tournament (or at least late in the season) when teams have played enough games that there's a reasonable chance to find similar past matchups.  In this case the authors have tested the scheme using Tournament games but only (if my reading is correct) looking in detail at the 2014 results.  That shows some positive benefits of the scheme, but 65 games is just too small a sample size to draw any conclusions.

Overall, I'm a little dubious that individual matchup effects exist, and that you can detect them and exploit them.  For one thing, if this were true I'd expect to see some obvious evidence of that in conference play, where teams play home-and-home.  For example, you might expect that if Team A has a matchup advantage over Team B that it would outperform expectations in both the home and away against Team B.  I haven't seen any evidence for that sort of pattern.  I've also looked at individual team adjustments a number of times.  For example, you might think that teams have individual home court advantages -- i.e., that Team A has a really good home crowd and does relatively better at home than other teams.  But I've never been able to find individual team adjustments with predictive value.  Sometimes teams do appear to have an unusually good home court advantage -- I recall a season when Denver was greatly outperforming expectations at home for the first part of the season.  But it (almost?) always turns out to be random noise in the data -- Denver's great home performance in the first part of the season evaporated in the second half of the season.
So this paper would have benefited from some more rigorous attempts to verify the existence and value of matchup effects, but it nonetheless presents and interesting idea and approach.
Lo-Hua Yuan, Anthony Liu, Alec Yeh, Aaron Kaufman, Andrew Reece, Peter Bull, Alex Franks, Sherrie Wang, Dmitri Illushin and Luke Bornn, "A mixture-of-modelers approach to forecasting NCAA tournament outcomes."

This paper discusses a number of predictive models created at Harvard for the 2014 Kaggle competition.  The final models included three logistic regressions, a stochastic gradient descent model, and a neural network.  Inputs to the models were team-level statistics from Pomeroy, Sonny Moore, Massey, ESPN and RPI.  The models were also used to build ensemble predictors.
Commentary:  This paper presents a very ordinary, not very interesting approach.  (I suspect that the Kaggle competition was used as an exercise in an undergraduate statistics course and this paper is a write-up of that experience.)  The approach uses standard models (logistic regression, SGD, neural networks) on standard inputs.  The model performances are also unusually bad.  None of the models performed as well as the baseline "predict every game at 50%" model.  Even a very naive model should easily outperform the baseline 0.5 predictor.  That none of these models did suggests very strongly that there is a fundamental underlying problem in this work.

The paper also spends an inordinate amount of time on "data decontamination" -- by which the authors mean you can't use data which includes the Tournament to predict the Tournament.  I realize that many Kaggle participants trying to use canned, off-the-shelf statistics like Pomeroy fell into this trap, but it's frankly a very basic mistake that doesn't warrant a journal publication.  The paper also makes the mistake of trying to train and test using only Tournament data.  The authors acknowledge that there isn't enough data in Tournament games for this approach to work, but persist in using it anyway.
Francisco J. R. Ruiz and Fernando Perez-Cruz, "A generative model for predicting outcomes in college basketball."

This paper extends a model previously used for soccer to NCAA basketball.  Teams have attack and defense coefficients, and the expected score for a team in a game is the attack coefficient of the team multiplied by the defense coefficient of the opponent team.  This basic model is extended first by representing each team as a vector of attack and defense coefficients, and secondly representing conferences as vectors of attack and defense coefficients as well.  The resulting model finished 39th in the 2014 Kaggle competition.  The authors also assess the statistical significance of the results of the 2014 Kaggle competition and conclude that 198 out of the 248 participants are statistically indistinguishable.  This agrees with the similar analysis in the Lopez paper.

Commentary: The approach used in this paper is similar to the one used by Danny Tarlow in the 2009 March Madness contest, although with a more mathematically sophisticated basis.  (Whether that results in better performance is unclear.)  The authors give an intuitively appealing rationale for using vectors of coefficients (instead of a single coefficient) to represent teams: "Each coefficient may represent a particular tactic or strategy, so that teams can be good at defending some tactics but worse at defending others (the same applies for attacking)."  It would have been helpful to have a direct comparison between a model with one coefficient and multiple coefficients to see if this in fact has any value.  Similarly, the idea of explicitly representing conferences has some appeal (although it's hard to imagine what reality that captures) but without some test of the value of that idea it remains simply an interesting notion.  Although the basic ideas of this paper are interesting, the lack of any validation is a major weakness.