Wednesday, December 12, 2012

Another Tool: Emacs

The tool chain I use for these experiments in predicting NCAA basketball games consists primarily of three tools:  Web Harvest, Emacs, SBCL (an implementation of Common Lisp), and RapidMiner.   I have written previously on this blog about RapidMiner, and today I'm going to touch on Emacs.

Most people use editors in a fairly static way.  They figure out how to do the editing tasks they need as some combination of commands and keystrokes, and that becomes part of their editing knowledge.  For example, suppose you're the sort of typist who regularly transposes letters, writing "the" as "teh" and "and" as "adn".  You'll fairly quickly figure out how to fix that problem -- backspace, backspace and retype, or maybe mark with a mouse, delete and retype -- and that becomes part of your editing knowledge.

A well-designed editor ensures that most of what you need to do is efficient.  The most common tasks have easy keystrokes and so on.  Of course, the editor is designed for some general, idealized user, who may be very like you in some ways and very different from you in other ways.  So it's likely that much of what you do in your favorite editor is efficient, but some of it is very inefficient and repetitive.

Emacs takes a different philosophy -- probably because it grew out of a community of programmers.

I recently read a blog posting saying that the essence of programming is to avoid repetition.  Programmers go to great lengths creating subroutines, libraries and sometimes whole new programming languages just to avoid mindless repetition of some task.  The ideal programmer spends all his time creating unique solutions to unique problems -- everything else is automated.  Emacs captures this same philosophy in an editor.  Emacs is designed so that it can be customized/programmed by the user in powerful and flexible ways, so that the user spends all his time being "maximally" productive.  (Expert Emacs users do so much customization of the editor that a big topic is how to best manage the customizations!)

To return to the above example, if you're a typist who often transposes letters and Emacs is your editor, your response is to customize/program Emacs to take care of your transposing problem more efficiently.  For example, you'd likely define a key to reverse the transposition of the previous two characters, so that you could just hit that key and fix the problem whenever it occurred.

(As it happens, Emacs already has this capability -- it's ctl-t -- but you see the point.)

This philosophy changes the way you use your editor -- it becomes a kind of Swiss Army knife tool for solving all text-related tasks (and often other types of tasks as well).  For example, in the basketball predictor I scrape scores and other statistics from the Web and use them in both Common Lisp and in RapidMiner.  For use in Common Lisp, it is convenient to have the data in a list format like this:
("Miss. Valley St" 40 18 70 4 29 9 14 16 42 70 "2012-12-10")
For use in RapidMiner (and Excel) it is more convenient to have the data in a comma-separated values format like this:
"Miss. Valley St", 40, 18, 70, 4, 29, 9, 14, 16, 42, 70, "2012-12-10"
It's not difficult to translate from one to the other, but it is boring and repetitious.  The natural response in Emacs is to automate the task.

Emacs has a variety of ways to do this.  (If you know Emacs, this won't surprise you!)  One of the simplest is a "keyboard macro".  You tell Emacs you want to define a keyboard macro, and then you start editing.  When you tell it your done, it captures all the editing you did in-between and allows you to repeat that with a single keystroke.  In this case, I would start a keyboard macro, go through all the editing necessary to convert one line of my data file from one format to the other, and then end the macro.  Then I could go to the next line and tell Emacs to "execute the keyboard macro" and -- voila! -- that line would get the same editing. It takes some practice and thought to create an editing sequence that will do the right thing when it is repeated on the next line, but this turns out to be a very powerful and handy feature.

One of the drawbacks of the keyboard macro is that it disappears when you end your editing session.  So it's not useful per se for a task that you're going to want to repeat another day on a different file.  Fortunately, Emacs provides a way to save a keyboard macro in a format that looks like this:

(fset 'fix-sched
   [escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?^ return ?\( ?\" return escape ?< escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?  ?* ?, return ?\" ?  ?\" return escape ?< escape ?r ?\" ?+ return return escape ?< escape ?r ?\" ?- return return escape ?< escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?$ return ?\) return])
That's exactly what it looks like -- a literal transcription of the keystrokes of the macro.  In this form, it can be saved your Emacs configuration file so that the next time you start up Emacs it will be available for reuse.

At a more complex level, you can program Emacs using a form of Lisp.  You can use this to create arbitrary functionality.  For example, if Emacs didn't provide a way to save a keyboard macro, you could program that yourself.  This allows you to build functionality that isn't easy to capture in a keyboard macro.  For example, here's an Emacs function I wrote for fixing a certain type of score file:
(defun fix-scores ()
 "Fix the scores from Marsee"
 (interactive "*")
 (let ((dt (format-time-string current-date-format-marsee)))
    "^\\([A-Za-z \\&]+[A-Za-z]\\)\\s +\\([0-9]+\\)\\s +\\([A-Za-z \\&]+[A-Za-z]\\)\\s +\\([0-9]+\\).*$"
    (concat "(\"" dt "\" \"\\1\" \\2 \"\\3\" \\4)")
Without going into the gory details, you can see that part of this function reformats the date from the score file into a more desirable format, using an Emacs function called "format-time-string".   Emacs Lisp is infinitely powerful, so if you're a good programmer you can extend the Emacs functionality in unlimited ways.

I have more to say on Emacs, but this posting has gotten fairly long so I'll leave further thoughts to another day.

Thursday, December 6, 2012

Some Recent Papers

Reviews of some recent papers on sports prediction.
Forecasting in the NBA and Other Team Sports: Network Effects in Action
Universidade Federal de Minas Gerais
CHRISTOS FALOUTSOS, Carnegie Mellon University
This paper looks at predicting the overall season performance of NBA teams (won-loss record) based upon features having to do with the team's year-to-year composition, such as "team volatility", "team inexperience," and so on.  (The authors call these features "network effects" because they model the NBA as a network of nodes representing players & coaches, with network links representing business relationships like "played for" or "played with".)  The model does surprisingly well at predicting season performance when compared against a variety of other models.

From the viewpoint of predicting NCAA basketball games, this work has limited applicability.  First of all, these authors are predicting the outcome of the entire season, not individual games.  Second, the nature of the NBA -- with the most important players having 10+ year careers and often changing teams -- makes the year-to-year movement of players more relevant than in the NCAA game.  On the other hand, any predictive value this information has seems likely to be orthogonal to the information from past game performances, which would be valuable.

A network-based ranking system for US college football
Juyong Park and M. E. J. Newman
Department of Physics and Center for the Study of Complex Systems,
University of Michigan, Ann Arbor, MI 48109
This paper ranks college football teams by calculating a score based upon "total win score" and "total loss score".  The total win score is the sum the team's total wins plus the total win score of all the opponents it beat (discounted by a constant factor).  Total loss score is calculated in a similar way, and the final score is total win score minus total loss score.

This approach is similar to systems like infinitely deep RPI, or Govan ratings, although the former uses win percentage rather than wins and losses, and the latter uses points scored/allowed.  This approach seems to do fairly well at ranking (the authors didn't use it for prediction) and may be worth trying for college basketball.
Are Sports Betting Markets Prediction Markets?
Evidence from a New Test
Kyle J. Kain, and Trevon D. Logan
This paper looks at the predictive value of point spreads and over/under lines from bookmakers on NFL, NBA, NCAA college football, and NCAA college basketball games from 2004-2010.  Without delving into the details, the bottom line from the paper is:
Our joint tests revealed that while the betting line is an accurate predictor of the margin of victory, the over/under is a poor predictor of the sum of scores in a contest.
I suspect this is because over/under is much more difficult to predict.  But this suggests that if you're out to beat the bookmakers, you might want to focus your efforts on predicting over/under rather than margin of victory.

Using ELO ratings for match result prediction in association football
Lars Magnus Hvattum, Halvard Arntzen
This paper applies the ELO rating to association football and compares it to various other predictors.  Vanilla ELO uses just the match outcome, but the authors modified the algorithm to use the score differential as well.  Performance was on par with other statistical predictors, but did not beat the oddsmakers.

Wednesday, November 21, 2012

Another Approach to Early Season Performance

Continuing on with my efforts to better model early season performance, it occurred to me that it might be good to model a team as the average of several previous years teams.  So we'd predict that Duke 2012-2013 would perform like an average of the 2009-2010, 2010-2011, and 2011-2012 teams.

This is a fairly straightforward experiment in my setup -- I just read in all three previous seasons as if they were one long preseason, and then predict the early season games.  Of course, with a twelve thousand game "preseason" this takes a while -- particularly when you keep making mistakes at the end of the processing chain and have to start over again :-).

At any rate, the conclusion is that this approach doesn't work very well.  The MOV error over the first thousand games was 12.60 -- worse than just priming with the previous seasons data.

Tuesday, November 13, 2012

More on Early Season Performance

Prior to my recent detour, I was looking at predicting early season performance.  To recap, experiments showed that predicting early season games using the previous season's data works fairly well for the first 800 or so games of the season.  However, "fairly well" in this case means an MOV error of around 12, which is better than predicting with no data, but not close  to the error of around 11 we get with our best model for the rest of the season.  The issue I want to look at now is whether we can improve that performance.

A reasonable hypothesis is that teams might "regress to the mean" from season to season.  That is, the good teams probably won't be as good the next season, and the bad teams probably won't be as bad.  This will be wrong for some teams -- there will be above-average teams that get even better, and below-average teams that get even worse -- but overall it might be a reasonable approach.

It isn't immediately clear, though, how to regress the prediction data for teams back to the mean.  For something like the RPI, we could calculate the average RPI for the previous season and push team RPIs back towards that number.  But for more complicated measures that may not be easy.  And even for the RPI, it isn't clear that this simplistic approach would be correct.  Because RPI depends upon the strength of your opponents, it might be that a team with an above-average RPI who played a lot of below-average RPI teams would actually increase its RPI because we would be pushing the RPIs of the its opponents up towards the mean.

A more promising (perhaps) approach is to regress the underlying game data rather than trying to regress the derived values like RPI.  So we can use the previous season's data, but in each game we'll first reduce the score of the winning team and raise the score of the losing team.  This will reduce the value of wins and the reduce the cost of losses, which should have the effect of pulling all teams back to the mean.

The table below shows the performance when scores were modified by 1%:

  Predictor    % Correct    MOV Error  
Early Season w/ History75.5%12.18
Early Season w/ Modified History 71.7%13.49

Clearly not an improvement, and also a much bigger effect than I had expected.  After all, 1% changes most scores by less than 1 point.  (Yes, my predictor is perfectly happy with an 81.7 to 42.3 game score :-)  So why does the predicted score change by enough to add 1+ points of error?

Looking at the model produced by the linear regression, this out-sized response seems to be caused by a few inputs with large coefficients.  For example, the home team's average MOV has a coefficient of about 3000 in the model.  So changes like this scoring tweak that affect MOV can have an outsized impact on the model's outputs.

With that understood, we can try dialing the tweak back by an order of magnitude and modify scores by 0.1%:

  Predictor    % Correct    MOV Error  
Early Season w/ History75.5%12.18
Early Season w/ Modified History (0.1%)  74.8%12.15

This does slightly improve our MOV error.  Some experimenting suggests that the 0.1% is about the best we can do with this approach.  The gains over just using the straight previous season history are minimal.

Some other possibilities suggest themselves, and I intend to look at them as time permits.

Thursday, November 8, 2012

How to Pick a Tournament Bracket, Part 2

In the previous post, I looked at a strategy for picking a Tournament bracket.  The basic idea is that to win a sizable Tournament challenge, you can't just pick the most likely outcome of each game.  You're going to have to pick at least some of the inevitable upsets correctly.  A reasonable way to do that is to decide how many points from upsets you think you'll need, and then pick some combination of upsets to reach that number.  It turns out the best way to do that is to pick late-round upsets between closely-matched teams.

However, there are some concerns with that approach.  One is that if you pick "likely" upsets (such as a #2 over a #1), it's reasonable to assume that many of your competitors might pick the same upset.  So although the upset might be both likely and high-scoring, it might not do much to separate you from your competitors.  That's an interesting problem, but one we'll leave for another day.

Another concern is that the strategy is "all or nothing."  We are assuming that we'll need (say) G = 16 points to win the Tournament challenge and make picks accordingly.  But in truth our chance of winning the Tournament challenge is more of an S-curve:

We have some guess at G that will give us a reasonable chance to win the challenge, but we might end up needing more or we might be able to win with less.  With G = 16 the strategy I've outlined so far leads us to pick a single 16 point upset in a semi-final game.  This is fine if the upset occurs.  But if it doesn't, losing 16 points moves us a sizable distance to the left on the S-curve and greatly reduces our chances of winning.  Something of this sort happened to my entry (the Pain Machine) in the last Machine Madness contest -- the PM predicted a Kansas-Kentucky upset that would have left it at G = 34.  But since Kentucky won, the PM ended up at G = 2 and lost the contest to a predictor at G = 7.  We'd really like a strategy that optimizes our chance to win under all possible scenarios.

Mathematically, this is the sum over all possible outcomes of the likelihood of the outcome times the likelihood of winning under that outcome.  If we pick a single 16 point upset, then there are two possible outcomes: the upset happens or it doesn't.  If L(n) is the likelihood of winning the challenge with n points, then expected value of that strategy is:
EV = L(0) * (1 - p(u,v)) + L(16) * p(u,v)

But if instead we picked two 8 point upsets, then there are four possible outcomes: neither of the upsets occur, the first upset occurs but the second doesn't, the first doesn't but the second does, or they both occur.  The expected value of this strategy is more complicated: 
EV = L(0) * (1 - p(u1,v1)) * (1 - p(u2,v2))
         L(8) * p(u1,v1) *  (1 - p(u2,v2))
         L(8) * (1 - p(u1,v1)) * p(u2,v2) + 
         L(16) * p(u1,v1) * p(u2,v2)
Depending upon the probability of the various outcomes and the likelihood of winning, the expected value of this strategy might be higher than picking a single 16 point upset, even though the chances of scoring the full 16 points are reduced.

Up until now, I've been implicitly assuming that the possible outcomes of an upset pick are either zero or n points.  But that's not really true.  The cost of an incorrect pick can be greater than just losing the points for that game. 

For example, last year the Pain Machine correctly predicted that #15 Lehigh would beat #2 Duke.  Given the rarity of 15-2 upsets, that was an amazing prediction.  But even if it was a very likely upset, it would have been a bad pick, because there was a potentially high cost if the upset didn't happen.  To see why, here is the bracket:

If the prediction is correct, the Pain Machine picks up 1 point for the correct first round prediction.  But if the prediction is incorrect, the Pain Machine is very likely to lose 2 points when Duke wins in round, 4 more points when they win in the third round, and so on. (Since as a #2 seed, we expect them to win until the round of eight.)

We can generalize this idea as a value formula for a win by U over V:

        V(u,v) = (p(u,v) * roundi ) - (p(v,i+1) * roundi+1 ) - (p(v,i+1)*p(v,i+2)* roundi+2 ) ...

Here, p(u,v) represents the probability that U defeats V, roundi represents the scoring value for the Ith round of the tournament, and p(v,i+1) represents the probability of V defeating their likely opponent in  round i+1 if they had not been upset by U.  To return to the Lehigh-Duke example, the value is the probability of Lehigh beating Duke times 1 (the value of that round) minus the probability of Duke beating Notre Dame (their expected opponent) times 2 (the value of that round), and so on.

To maximize V(u,v) we must maximize p(u,v) and minimize p(v,i+1).  And since roundi+1 = 2*roundi, it is twice as important to minimize p(v,i+1).  To translate this into plain English, we want to pick upsets where the team being upset has very little chance to win its next round game.  That's why the Lehigh upset was a poor pick -- because as a #2 seed Duke had a very good chance to win its second round game.

Instead, this formula will value upset picks like #10 Xavier over #7 Notre Dame.  (Which also happened!) To see why, look again at the bracket:
Whichever team wins the first round game -- Notre Dame or Xavier -- they are likely to lose the second round game to the stronger Duke team.  Thus the downside of the upset pick is minimized -- if Notre Dame wins and then loses to Duke as expected, you'll only have lost one point for the incorrect upset pick.

This insight is nothing new.  Canny pickers already look for upsets that are "firewalled" off in the next round by a strong opponent.  However, the value formula above gives us an objective measure for comparing between possible upset picks.  I suspect that most people incorrectly assess the p(u,v)  vs. p(v,?) tradeoff.  Because scoring doubles each round, it's much more important to consider the cutoff in the next round than the upset chance -- which most people probably find counter-intuitive.

Unfortunately, the strategy of "firewalling" upset picks runs counter to the strategy of picking high-scoring late round upsets, because (assuming most games are not upsets) the mismatches which make good firewalls primarily occur in the early rounds.  If the tournament runs mostly true to the seedings, the late round games are usually between closely-matched teams and do not make good firewalls.  An interesting exception is the Championship game itself.  If you pick an upset in the Championship game incorrectly, you're guaranteed not to lose any additional points. 

To summarize these thoughts about picking a tournament bracket:
  1. A bracket consisting of chalk picks and true mis-seedings is not likely to win a sizable Tournament challenge.
  2. Picking late-round upsets between highly-seeded teams has the advantages of (1) scoring a lot of points, and (2) being relatively likely to occur.
  3. To maximize the overall chance of winning the challenge, it may be better to spread your upset picks rather than bet "all or nothing."
  4. Upset picks which are firewalled in the next round reduce the downside risk of an incorrect pick.

Wednesday, November 7, 2012

How To Pick A Tournament Bracket, Part 1

Pre-season is probably not the best time to be pondering the Tournament, but I've been recently thinking a bit more about the challenge of predicting the Tournament with the goal of winning something like the ESPN Tournament Challenge or the Machine Madness contest.  These sorts of contests are a dilemma to a machine predictor, because most predictors try to determine who is most likely to win a particular matchup.  But of course, that's exactly how the Tournament is seeded.  So the machine predictors end up predicting almost entirely "chalk" outcomes.

The only time the machines don't predict a win for the higher seed is when they believe the teams have been mis-seeded -- that is, when the Committee has made a mistake in their assessment of the relative strengths of the teams.  In last year's Machine Madness contest, Texas over Cincy and Purdue over St. Mary's were consensus upset picks by five of the six predictors -- strong evidence (to my mind, anyway) that those teams were mis-seeded.  But, for all the grumbling by fans, the Committee does a pretty good job at seeding the Tournament, and you can't expect to find many true mis-seedings. 

Neither chalk picks or mis-seedings are likely to win a Tournament challenge against a sizable field.  That's because (1) a lot of your competitors will have made the same picks, (2) there will be a significant number of true upsets where a weaker team beats a stronger team (historically, 22% in the first round, and 15% for the Tournament overall), and (3) someone out there will have picked those upsets.  So to win a Tournament challenge, the machine is going to have to pick some actual upsets -- and then hope that it gets lucky and those upsets are the ones that happen.

Knowing the historical frequency of upsets, my strategy last year was to force my predictor to pick 6 upsets in the first round and 5 more in the rest of the tournament.  But is that the right way to pick upsets?  How can we pick upsets to maximize (in some sense) our chance to win the Tournament challenge?

The first problem in answering this question is knowing how many points will be sufficient to win the challenge, because that will drive the selection of upsets.  Obviously, it's impossible to know this number a priori.  However, we could look at previous Tournament challenges and see how many points the competitors in the top (say) 1% had scored off correctly predicting upsets.  That would provide a reasonable goal G for our upset calculations.

Sadly, ESPN, Yahoo, etc., seem to remove the Tournament challenge information from the Internets fairly quickly, so I can't actually research this.  (If someone has some info on this, please let me know!)  However, we do have the results of the last two Machine Madness contests.  Last year, the winning entry scored 127 points and the "chalk" (baseline) entry scored 120 points, for G = 7.  The year before, the winning entry scored 69 points and the chalk entry scored 57 points, for G = 12.  (There's undoubtedly a correlation between the size of the field and G.   G = 12 might be sufficient most years to win the Machine Madness contest, but probably wouldn't be enough to win the ESPN Tournament Challenge.)

If we adopt the notation that V(u,v) is the value of a victory of Team U over Team V, then we will want to pick upsets such that:
G < V(u1,v1) + V(u2,v2) + V(u3,v3) ...
Because of the way the tournament is structured, the value of V(u,v) is determined by the seeding of the two teams.  The following table has seedings down both axises and shows how many points an upset is worth:

For example, a #8 seed beating a #1 seed is worth 2 points, because that matchup will necessarily occur in the second round.

If we adopt the notation that p(u1,v1) is the probability of u1 defeating v1, then the probability of G is:
p(u1,v1) * p(u2,v2) * p(u3,v3)  ...
(because we must get all of our upset picks correct to score G points).

Now imagine that we are predicting the tournament and we know that most games have a 0% chance of an upset.  However, four of the third round games are very likely upsets -- 49%.  And one of the semi-final games has a slight chance of an upset -- about 6%.  If G = 16, which upsets should we pick?

The (possibly surprising) answer is that we should pick the very unlikely semi-final upset!  To get 16 points we have to pick either the semi-final game, or all four of the third round games, and:

       .06 > .49*.49*.49*.49

The joint probability equation combined with the typical Tournament challenge scoring means that it will almost always be better to pick unlikely late round upsets that score highly than multiple likely early round upsets that score poorly.  Knowing this, it's easy to see that my strategy in previous years to force a certain number of upsets into my bracket was very non-optimal.

So, given a goal G and upset probabilities p(u,v) we have an approach for selecting upsets from our bracket.  We've seen how to calculate V(u,v) and how to estimate GHow can we estimate the upset probabilities?

Many predictors will produce something that can be used to estimate upset probabilities.  For example, in past years my predictor has used the predicted MOV to estimate upset probabilities -- the slimmer the predicted margin of victory, the more likely an upset.  But lacking any information of that sort, we could estimate the upset probabilities based upon historical performance of seeds within the tournament:

This table shows the upset percentage for each seeding matchup for the last ten years.  (I have left out matchups that have occurred 4 times or less.)  Each upset percentage is shaded to indicate the value of the matchup in typical Tournament challenge scoring.  For example, matchups between #1 seeds and #2 seeds have been won by the #2 seeds 52% of the time, and they are worth 8 points. With a few oddball exceptions (such as the 2-10 matchups), this table shows is that you should prefer to pick upsets of the #1 or #2 seeds by #2 or #3 seeds.  These matchups are worth the most points and -- because the teams are closely seeded -- are nearly tossups.

So if G = 16, filling out your bracket with all chalk picks and two #2 over #1 upsets would give you the best chance to win the Tournament challenge.

More thoughts to follow in Part 2 at some later date.

Friday, November 2, 2012

Papers Archive Updated

The archive of academic papers on rating sports teams or predicting game outcomes has been updated to include the papers reviewed here as well as about a half-dozen other new papers.  A listing of the papers and a link to the archive can be found on the Papers link on the right side of this website.

I'm always interested in the latest research in this area, so please let me know if you're publishing a relevant paper or if you have a pointer to a paper I've missed.  Thanks!

Wednesday, October 31, 2012

Recent Papers

I took some time out recently to read through some of the basketball prediction papers from this year's MIT Sloan Sports Analytic Conference.  Here are some thoughts...
Insights from the LRMC Method for NCAA Tournament Prediction
Mark Brown, Paul Kvam, George Nemhauser, Joel Sokol
MIT Sloan Sports Analytics Conference 2012

The latest paper from the LRMC researchers compares the performance of LRMC to over 100 other ranking systems as reported by Massey here.  The measure of performance used is correct prediction of the NCAA tournament games.  LRMC out-performs all of the other rankings, getting 75.5% correct over 9 years.  The next best predictor did 73.5%.  (I don't optimize my predictor on this metric, but it also gets about 73.5% correct.)

The LRMC work is always interesting and well done.  A couple of notes that pop to mind:

(1) The advantage LRMC has over the other models is not huge.  LRMC gets 75.5% correct; the 20th ranked model gets about 72% correct -- the difference is about 3 games per tournament.  That's certainly significant, but in a test set of only 600 games, it may not be that significant.  One very good year (or one very bad year) could move a rating significantly.  It would be interesting to see the year-to-year performance of the ratings, but the authors don't provide that information.

(2) The authors assume there is no home court advantage (HCA) in the NCAA tournament and simply predict that the higher-rated team will win.  In my testing, including an HCA for the higher-seeded team improves prediction performance.  For example, this paper reports the performance of RPI as about 70% in predicting tournament games.  In my testing, RPI with HCA predicted about 73% correctly.  So the results may be skewed depending upon how much effect HCA has on each prediction model.  (The authors don't use HCA for LRMC, so that model might do better as well.)

(3) In this paper, the authors test against all the matchups that actually occurred in the tournament -- that is, they do not "fill out a bracket" and commit to game predictions at the beginning of the tournament.  In 2011, LRMC was included in the March Madness Algorithm Challenge and finished quite poorly -- outscored by all but three of the other entrants.   (A similar result can be seen here.)  Taking a look at the LRMC bracket for 2012 (here), LRMC got 22 correct picks out of the initial 36 games -- and got only one of the three play-in games correct, missed all of the upsets, and predicted two upsets that did not occur.  Eight of the entries in the algorithm challenge picked more first-round games correctly.  In fact, LRMC's only correct predictions in the entire tournament were higher seeds over lower seeds.  And once again it would have lost the algorithms challenge.

(4) My own attempts to implement LRMC and use it to predict MOV (found here) have performed more poorly (around 72%) than the authors report in this paper.  It may be that my implementation of LRMC was faulty, or that LRMC happened to perform slightly worse on my test data than on the tournament games used in this paper.

Moving on from the performance of LRMC, there are a couple of other interesting results in this paper.  One is that  home court advantage does not vary substantially from team to team.  This confirms my own experiments.  (I don't think I've reported on those experiments -- perhaps I'll write them up.)  A second is that the natural variance in games is around 11 points, which matches closely what I've found.  The last is that the authors found that the cliche "good teams win close games" doesn't seem to have any validity.

Can Statistical Models Out-predict Human Judgment?: Comparing Statistical Models to the NCAA Selection Committee
Luke Stanke
MIT Sloan Sports Analytics Conference 2012
As with the LRMC paper, this paper looks at predicting tournament game outcomes. In this case, the author compares the NCAA committee seedings and RPI ratings to four different Bradley-Terry models.  For more on Bradley-Terry models, see here.

Stanke reports the results of testing these models against (approximately) the same games used in the LRMC paper:

The highest performing models are the Bradley-Terry models using only win/loss data. These two models correctly predicted approximately 89% of games in the NCAA tournament games from the past eight seasons. The next group of models is the Bradley-Terry Models using points a method for ranking teams. These models predicted over 82% of games correctly. The third group is the alternative models, the Committee Model, The RPI Model, and the Winning Percentage Model. These models range from 69.1% of games correctly picked to 72.9% of games correctly picked.
This is certainly an interesting result -- particularly in light of the claims of the LRMC paper.  According to the LRMC authors, LRMC's 75.5% success rate out-performed over 100 other rankings from Massey's page, and the Vegas line's success rate of ~77% is an upper-bound to performance. 

So what explains this disparity?  I didn't know -- so I sent off an email to the author.   Luke Stanke emailed me to say that the result was caused by a coding error, and that actual performance was around 72%.   (I know all about coding errors... :-)  So his results here are in line with the expected performance for Bradley-Terry type rating systems.  His conclusion remains unchanged -- that computer rating systems are better than the committee at selecting and seeding the tournament, and Bradley-Terry would be better than RPI.  I won't disagree with either conclusion! :-)

Using Cumulative Win Probabilities to Predict NCAA Basketball Performance
Mark Bashuk
MIT Sloan Sports Analytics Conference 2012
Bashuk lists his affiliation as "RaceTrac Petroleum" so like me he appears to be an interested amateur in game prediction.  In this paper he describes a system that uses play-by-play data to create "Cumulative Win Probabilities" for each team, and eventually, a rating.  He uses this rating to predict game outcomes, and for the 2011-2012 season correctly predicts 72.6%.  In comparison, Pomeroy predicts 77.7% correctly and the Vegas Opening Line 75.2%.

It is unclear to me after reading the paper exactly how CWP and the ratings are calculated.  However, unlike most authors, Bashuk has made his code available on the Web.  (URLs are provided in Appendix 1 of the paper.)  This is very welcome to anyone trying to reproduce his results.  Unfortunately for me, Bashuk's code is in SQL, which I don't understand well.  So poring through it and understanding his process may take some time.

Thursday, October 25, 2012

A Detour into RapidMiner

A chunk of visitors to this blog find it looking for RapidMiner, so I thought I'd take a detour to explain the RapidMiner process I'm using to explore early season performance.  This RapidMiner process uses training data to build a model, applies the model to separate test data, and then measures performance.  This is something of a sequel to the post I did for Danny Tarlow over at his blog.  Hopefully it will be useful to some folks as an example of how to put together a more complex RapidMiner process, as well as how to apply a model to test data, which wasn't covered in the previous post.

(Reminder: RapidMiner is a free data-mining tool that you can download here.)

The (unreadable) graphic above illustrates the entire process.  There are four parts to this process.  In Process 1, the training data is read in and processed.  In Process 2, the test data is read in and processed.  In Process 3, the training data is used to build a linear regression and then the model from that regression is applied to the test data.  In Process 4, the results are processed and performance measures calculated.  I'll now go into each process in detail.

The graphic above shows Process 1 in more detail. It's a straightforward linear flow starting at the upper left and ending at the lower right.  The steps are:
  1. Read CSV -- This operator reads in the training data, which is simply a large text file in comma-separated value (CSV) format, with one line for every game (record) in our training data set.
  2. Generate ID -- This operator adds a unique ID attribute to every record in our training data set.  (We'll see later why it is useful to have a unique ID on every record in the data set.)
  3. Rename by Replacing -- This operator is used to rename attributes in the data set.  In this case, I use it to replace every occurrence of a dash (-) with an underscore ( _ ).  Dashes in attribute names are problematic when you do arithmetic on the attributes, because they get mistaken for minus signs.
  4. Generate Attributes -- This operator generates new attributes based on the existing attributes.  In this case, I calculate a new attribute called "mov" (Margin of Victory) by subtracting the visiting team's score from the home team's score.
  5. Set Role -- Most attributes are "regular" but some have special roles.  For example, the ID attribute generated in step 2 has the "id" role.  Here I use the Set Role operator to set the role of the "mov" attribute to "label."  This role identifies the attribute that we are trying to predict.
  6. Read Constructions -- You can use the Generate Attributes operator to generate new attributes, but that's not convenient if you want to generate a lot of new attributes, or if you want to generate new attributes based on some external inputs.  In my case, I have generated and tested many derived statistics, and entering them manually into "Generate Attributes" was not feasible.  The Read Constructions operator reads formulas to generate new attributes from a file and creates them in the data set.  Using this, I was able to have a Lisp program create a (long) list of derived statistics to test, write them to a file, and then have the RapidMiner process construct them automatically.
  7. Replace Missing Values -- This is the first of several data cleanup operators.  There shouldn't be any missing values in my data sets, but if there is, this operator replaces the missing value with the average over the rest of the data.
  8. Replace Infinite Values -- Some of the constructions in Step 6 can result in "infinite" values if (for example) they cause a divide by zero.  These two operators replace positive infinite values with 250 and negative infinite values with -250.
  9. Select Attributes -- The last operator in this process removes some attributes from our data.  In particular, we don't want to leave the scores in the data -- because the predictive model will (rightfully) use those to predict the MOV.  (The MOV itself is not a problem, because it has the "label" role.)  We also remove a couple of other attributes (like the team names) that would cause other problems.
So at the end of this process, we have read in the training data, processed it to contain all the attributes we want and none that we don't want, and cleaned up any inconsistencies in the data.

Process 2 is exactly the same as Process 1, except it is applied to the test data.  It's important to ensure that both the training data and the test data are treated identically.  If they aren't, you'll get misleading results or cause an error later in the process.  (I should point out that RapidMiner can bundle up a process into a sub-process and reuse it in multiple places, and that's probably what I should do here.)

The graphic above shows Process 3.  I've left in the "Select Attributes" from the end of Process 1 and Process 2 for context.  Here are the steps in Process 3:
  1. Linear Regression -- RapidMiner offers a wide variety of classification models that can be used for prediction.  In this case we're using a linear regression.  The input to this operator is the training data, and the output is a model.  This operator trains itself to predict the "label" attribute (MOV in our case) from the regular attributes.  The model it produces is a linear equation based upon the regular attributes.  In my process here, I'm training the model every time I run the process.  It's also possible to train the model once, save it, and re-use it every time you want to test or predict.  In my case, I tweak the data and/or process almost continuously, so it's easiest just to re-train every time.  There are about 15K records in the training data set, and the Linear Regression takes a couple of minutes on my laptop.  Other classification operators are much slower, and re-training each time is not feasible.
  2. Apply Model -- This operator applies the model from step 1 to the testing data from Process 2 and "labels" it -- that is, it adds a "prediction(mov)" attribute that has a predicted Margin of Victory for the game.
  3. Join -- This operator "joins" two data sets.  To do this, it finds records in the two data sets that have the same ID and then merges the attributes into a single record.  (Now we see why we need a unique ID!)  The two data sets being merged here are (1) the labeled data from the model, and (2) the original data from the Select Attributes operator.  Recall that the Select Attributes operator is used to remove unwanted attributes from the data, including the team names and scores.  So the labeled data coming out of the model does not have that information.  However, to evaluate our predictive performance we need the scores (so we can compare the actual outcome to the predicted outcome) and it would be nice to have team names and dates on the data as well.  So this Join operator puts those attributes back into our data.  In general, this is a useful technique for temporarily removing attributes from a data set.
At this point, we have a data set which consists of our test data with an added attribute "prediction(mov)"containing the predicted margin of victory for each game.  Next we want to see how well our model performed.

The graphic above shows Process 4.  I've left in the "Join" from Process 3 to make it clear where it connects.  Here are the steps to Process 4:
  1. Rename -- The first step is to rename the "prediction(mov)" attribute to "pred".  The parentheses in this name can confuse some later processing, so it's best just to remove them.
  2. Generate Attributes -- Next we generate a new attribute called "correct".  This attribute is 1 if we've correctly predicted the winner of the game, and 0 if not.  RapidMiner provides a powerful syntax for defining new attributes.  In this case, "correct" is defined as "if(sgn(mov)==sgn(pred),1,0)" -- if the sign of our predicted MOV is the same as the sign of the actual MOV, then we correctly predicted the winner.
  3. Write Excel -- At this point, I save the results to an Excel spreadsheet for later reference and processing (e.g., to produce the graphs seen here).
  4. Multiply -- I like to look at two different measures of performance, so I create two copies of the test data.  This isn't strictly necessary in this case (I could chain the two Performance operators) but this is another example of a useful general technique.
  5. Performance -- RapidMiner provides a powerful operator for measuring performance that can assess many different measures of error and correlation.  In the top use of Performance in this process, I use the built-in "root mean squared error" measure and apply it to the predicted MOV to calculate the RMSE error.
  6. Aggregate / Performance -- The second measure of performance I like to look at is how often I predicted the correct winner.  (I might prefer a model that predicts the correct winner more often even if it increases the RMSE of the predicted MOV.)  I want to know this number over the entire data set, so the first step is to Aggregate the "correct" attribute.  This produces a new attribute "sum(correct)" which is the number of correct predictions over the whole data set (and has the same value for every record in the data set).  This is then reported by the Performance operator as a performance measure.  The Performance operator isn't strictly necessary in this situation -- I could just report out the "sum(correct)" value -- but in general marking this as a measure of performance allows me to (for example) use the value to drive an optimization process (e.g., selecting a subset of attributes that maximizes the number of correct predictions).
And that's "all" there is to it.  One of the advantages of RapidMiner is that the graphical interface for building processes let's you quickly lay out a process, as well as easily modify it (such as switching the Linear Regression to an SVM, for example). 

Wednesday, October 24, 2012

A Closer Look at Early Season Prediction Performance

In the previous post, I looked at predicting early season games using my standard predictive model and found that performance was (understandably) much worse for the early season games where teams had no history of performance than in late season games, where we had the whole season's history to help guide the prediction.  I also looked at using the previous season's games to "prime the pump" and found that improved performance considerably.  In this post, I'll take a closer look at those two cases.

The graph above plots the prediction error for a moving twenty game window throughout the first 1000 games of the season.  (Note #1: The twenty game window is arbitrary -- but the data looks the same for other window sizes.  Note #2: This drops the first game for every team.  The model predicts a visiting team win by 224 points for those games, which greatly distorts the data.)  The green line is a linear regression to the data.  The prediction error starts out high (15+) and drops steadily throughout the 1000 games until at the end, it is close to the performance of the model for the rest of the season.

(There are some interesting aspects to this graph.  For example, much of the error seems to be driven by a few games.  For example, the peak at around 225 games is driven largely by two matchups: Georgetown vs. NC Greensborough and Colorado State vs. SMU.  In both cases, the predictor has an unrealistic estimate of the strength of one or both of the teams.  So it might be that we could greatly improve prediction by identifying those sorts of games and applying some correction.  A possible topic for another day.)

A logarithmic regression suggests that much of the error is eliminated during the first 500 games:

If nothing else, this plot suggests that even with no other measures, our predictions should be pretty good after about the 500th game.  Now let's take a look at a similar plot for predictions where the teams have been primed with the earlier season's games:

Huh!  The use of the previous season's games pins the predictive performance to about 12 RMSE.  It's easy to understand why.  The previous season's performance has decent predictive power -- certainly better than no data at all -- but swamps the current season's performance, preventing the predictor from improving.  Even by the end of the 1000 game period, most teams have only played 5 or 6 games.  The previous season's 30+ games simply out-weigh this season's games too much to let the performance improve.

We can plot the two trendlines to see where it stops paying off to use the primed data predictions:

The cutoff is around 800 games (if we include the first game for every team).  We can combine these two into a predictor that gradually switches over from one predictor to the other over the first 800 games.  That predicts games with about the same error rate as using the previous season's data -- the last 200 games are predicted better, but not enough to substantially move the average.

More to come.

(Incidentally, this is the 100th blog posting!)

Thursday, October 18, 2012

Early Season Predictions, Part 2

As mentioned previously, I'm using this time before the college basketball season gets going thinking about how to predict early season games.  In the early season, we're missing two elements needed for good predictions:  (1) a meaningful statistical description of each team, and (2) a model that uses those statistics to predict game outcomes.  By the end of the season we have both things - a good statistical characterization of each team as well as a model that has been trained on the season's outcomes.  So how do we replace those two elements in the early season?

Replacing the model turns out to be fairly easy, because the factors that determine whether teams win or lose don't change drastically from season to season.  When you try to predict the tournament games at the end of the season, a model trained on the previous season's games does nearly as well as a model trained on the current season's games.  Of course, if the current year happens to be the year when the NCAA introduces the 3 point shot, all bets are off.  Still, in my testing the best performing models are the ones trained on several previous years of data.  So in the early season we can expect the model from the previous season to perform well.

(You might argue that early season predictions could be more accurate with a model specifically trained for early season games.  There's some merit to this argument and I may look at this in the future.)

Replacing the team data is not so easy.  The problem here is that teams have played so few games (none at all for the first game of the season) that we don't have an accurate characterization of their strengths and weaknesses.  Even worse, many of the comparative statistics (like RPI) rely on teams having the same opponents to determine the relative strength of teams.  In the early season, the teams don't "connect up" and in some cases, play few or no strong opponents.  So how bad is it?  I tested it on games from the 2011-2012 season:

  Predictor    % Correct    MOV Error  
Late Season Prediction72.3%11.10
Early Season Prediction71.3%15.06

So, pretty bad.  It adds 4 points of error to our predictions.  Since we've been groveling to pick up a tenth of a point here and there, that's a lot!

The obvious proxy for the team data is to use the team data from the previous season.   Clearly this has problems -- in college basketball team performance is highly variable season to season -- but it's at least worth examining to see whether it does improve performance.  In this experiment, I used the entire previous season's data to "prime the pump" for the next season.  In effect, I treated the early season games as if they were being played by the previous year's team at the end of the previous season.  Here are the results:

  Predictor    % Correct    MOV Error  
Early Season 71.3%15.06
Early Season (w previous season) 75.5%12.18

A fairly significant improvement.  Is there anything we can do to improve the previous season's data as a proxy for this season?  We'll investigate some possibilities next time.

Thursday, October 11, 2012

2012-2013 Schedule Now Available

A quick FYI for the other college basketball predictors out there:  Yahoo Sports has now posted the schedule of upcoming games for next year.  As reported last time, the scores from past seasons remain broken, but at least the upcoming games are now available.

I've updated the Data page to include links to the 2012-2013 schedule as well as the current conference affiliations.

Tuesday, September 18, 2012

Awakening From the Long Summer's Sleep

College basketball fans hibernate in the summer.

I'm slowly awakening from my March Madness-induced stupor and starting to prepare for the new season.

One of the first tasks is to look at conference realignments.  My predictors don't actually use conferences for anything -- I keep thinking that conference games will have more predictive power than non-conference games, or vice versa, but to date neither has proven to be true.  Nonetheless, I keep track of the conference affiliations of teams, so every Fall I have to update that data for the various conference movements.

I took my summary of the changes from the "Blogging the Bracket" here.  If there's any interest in the compiled data, please let me know.  I've noticed that there's been little interest in the data files I provided last year, so I won't bother unless someone expresses some interest.

The next task is to scrape the schedule of games for the season.  In past seasons, I've scraped the schedule from Yahoo Sports.  Unfortunately, it appears that they have "updated" their interface and broken everything.  No scheduled games appear at all, and the majority of the tournament games from last year are missing as well.


Hopefully this is just a temporary situation while Yahoo Sports gets their bugs fixed and the data loaded.  Alternate sources of this data are not easy to find.  ESPN and CBS are still showing last year's games.  The NCAA website started carrying game results (and box scores) last season, but doesn't seem to have the upcoming games.

In the meantime, I've been thinking about how to predict early season games.  These games are difficult to predict because we do not have any history of past performance for this year's teams.  So we're forced to base our predictions on other data -- or to not predict early season games (which is what I've done in past seasons).  Some alternate data is only available for some of the teams (e.g., the AP preseason rankings) or is entirely subjective, which makes it less useful from my viewpoint.

One source of objective data for all the teams is their previous season's performance.  One approach to predicting the early season games is to assume that teams will be just as strong this year as they were last year.  Another approach might be to assume that teams will migrate towards the mean -- the best teams from last year will get a little weaker and the worst teams will get a little stronger.  We could also look at team data such as the number of graduating seniors and use that information to modify the previous year's performance -- e.g., a team that lost most of its starting minutes would get weaker.  An intriguing idea is to see if we can predict the change in performance for a team from season to season (based upon what factors?) and then use that to modify the previous year's performance.

As time permits, I will set up to test some of these ideas and report my findings.

Thursday, March 22, 2012

Yet Another Look at Upsets

What does it mean to call a tournament game an upset?

At the simplest level, it means a lower-seeded team beating a higher-seeded team.  This can happen for two reasons.  First, the committee may have "blown" the seedings -- as they arguably did with Texas / Cincinnati and Purdue / St. Mary's this year, two games that most of the machine predictors thought would be upsets.  Second, an upset can happen when the weaker team plays well and/or the better team plays poorly.  College basketball teams don't play at their mean performance every game.  Some games are better and some are worse, and this can lead to an unexpected result.  This understanding suggests that upsets may be more likely when two inconsistent ("volatile") teams meet.

Imagine two hypothetical teams that played the same schedule.  Team A averaged 84 points per game and scored between 81 and 88 points every game.  Team B also averaged 84 points per game, but scored between 28 and 96 points.  Now both these teams play Team C, that averaged 70 points per game against the same competition.  Which is team is Team C more likely to beat?  It seems reasonable to guess Team B. 

So how can we identify these "volatile" teams?  The obvious method is to measure something like the standard deviation of a team's performance over the course of the season.  But we have to be careful in how we do this.  For example, measuring the standard deviation of points scored might be very misleading because of pace issues.

Fortunately for me, I already have a good measure of team performance that includes standard deviation: TrueSkill.  This probably isn't a perfect proxy for measuring a team's consistency, but it's certainly good enough for a quick investigation into the merits of predicting upsets by looking at consistency.  (It's easier to think of this measure as volatility rather than consistency, so that the higher values mean more volatility.)

I took all of this year's first round games and ranked them according to the combined volatility of the two teams involved and then identified the most volatile game at each seed differential to see how well this predicted upsets:

Seeding Most Volatile Game by Seed Differential Upset?
8-9 Kansas St. - Southern Miss N
7-10 St. Mary's - Purdue Y
6-11 Murray St. - CSU N
5-12 Vanderbilt - Harvard N
4-13 Wisconsin - Montana N
3-14 Marquette - Iona N
2-15 Missouri - Norfolk St. Y
1-16 Syracuse - NC Asheville N

This seems mildly promising.  It identifies two upsets correctly, including the Missouri-Norfolk St. upset.  This is particularly interesting because that upset was not on anyone's radar.   Most of the other games are at least "reasonable" choices for upsets in their seedings.  (It also identifies CSU over Murray St, which may explain this pick by AJ's Madness in the Machine Madness contest.)

One problem with this approach is that seeding is a rather broad measure of team strength.  For example, Duke was by far the weakest of the #2 seeds.  It might be productive to use a more accurate measure of the strength differences between the teams.  We can use the mean TrueSkill measure for each team to do that, and rank teams according to the sum of the standard deviations divided by the difference of the means.  That results in this table:

Seeding Most Volatile Game by Strength Differential Upset?
8-9 Creighton - Alabama N*
7-10 St. Mary's - Purdue Y
6-11 SDSU - NC State Y
5-12 Temple -USF Y
4-13 Michigan - Ohio Y
3-14 Georgetown-Belmont N
2-15 Duke - Lehigh Y
1-16 North Carolina - Lamar N
* One point win for Creighton

This works remarkably well for this year's first round -- especially considering that there were no upsets in the 3-14 or 1-16 matchups.  Of course, identifying the most likely upset at a particular seeding isn't quite the same as identifying the most likely upsets across the whole bracket, so let's look at the top 8 upsets predicted by this metric across the entire first round:

Seeding Most Volatile Games Overall  Upset?
5-12 Temple - USF Y
6-11 SDSU - NC StateY
7-10 Notre Dame - Xavier Y
7-10 St. Mary's - Purdue Y
8-9 Creighton - Alabama N*
7-10 Florida - Virginia N
6-11 Cincinnati - Texas N
8-9 Memphis-St. Louis
* One point win for Creighton

Again, this is pretty good performance -- 75% correct in the first four picks and 50% correct in the first eight.

To a certain extent, a good predictor is going to capture some of this anyway (the Pain Machine identified the three correct upsets in the first four picks), but looking at the volatility of team performance may be good additional information in predicting tournament upsets.

Wednesday, March 21, 2012

Machine Madness Upsets

This posting originally appeared over at This Number Crunching Life.

In a previous posting I took a closer look at how the Pain Machine predicts upsets in the tournament and how effective it was this year.  I thought it might also be interesting to look at how the top competitors in the Machine Madness contest predicted upsets.  I put together the following table with the competitors across the top and an X in every cell where they predicted an upset.  Boxes are green for correct predictions and red for incorrect predictions.  The final row(s) in the table shows the scores & possible scores for each competitors.

Game Pain Machine Predict the Madness Sentinel Danny's
AJ's Madness Matrix Factorizer
Texas over Cincy X X X X X
Texas over FSU X X

WVU over Gonzaga X X

Purdue over St. Mary's X X X
NC State over SDSU X

South Florida over Temple X

New Mexico over Louisville X

Virginia over Florida


Colorado State over Murray State

Vandy over Wisconsin

Wichita State over Indiana

Murray State over Marquette

Upset Prediction Rate 43% 25% 33% 0% 25% 29%
Current Score 42 43 42 41 41 39
Possible Points 166 155 166 161 137 163

(I'm not counting #9 over #8 as an upset. That's why Danny has only 41 points; he predicted a #9 over #8 upsets that did not happen.)

So what do you think?

One thing that jumps out immediately is that the competitors predicted many more upsets this year than in past years.  Historically we'd expect around 7-8 upsets in the first two rounds.  Last year the average number of upsets was about 2 (discounting the Pain Machine and LMRC).  The Pain Machine is forced to predict this many, but this year the Matrix Factorizer also predicts 7, and Predict the Madness and AJ's Madness predict 4.  From what I can glean from the model descriptions, none of these models (other than the Pain Machine) force a certain level of upsets. 

Monte's model ("Predict the Madness") seems to use only statistical inputs, and not any strength measures, or strength of competition measures.  This sort of model will value statistics over strength of schedule, and so you might see it making upset picks that would not agree with the team strengths (as proxied by seeds).

The Sentinel uses a Monte Carlo type method to predict games, so rather than always produce the most likely result, it only most likely to produce the most likely result.  (If that makes sense :-)  The model can be tweaked by choosing how long to run the Monte Carlo simulation.  With a setting of 50 it seems to produce about half the expected number of upsets.

Danny's Dangerous Picks are anything but; it is by far the most conservative of the competitors.  The pick of Murray State over Marquette suggests that Danny's asymmetric loss function component might have led to his model undervaluing strength of schedule.

AJ's Madness model seems to employ a number of hand-tuned weights for different components of the prediction formula.  That may account for the prediction upsets, including the somewhat surprising CSU over Murray State prediction.

The Matrix Factorizer has two features that might lead to a high upset rate.  First, there's an asymmetric reward for getting a correct pick, which might skew towards upsets.  Secondly, Jasper optimized his model parameters based upon the results of previous tournaments, so that presumably built in a bias towards making some upset picks.

What's interesting about the actual upsets?

First, Texas over Cincy and Purdue over St. Mary's were consensus picks (excepting Danny's Conservative Picks).   This suggests that these teams really were mis-seeded.  Purdue vs. St. Mary's is the classic trap seeding problem for humans -- St. Mary's has a much better record, but faced much weaker competition.  Texas came very close to beating Cincinnati -- they shot 16% in the first half and still tied the game up late -- which would have made the predictors 2-0 on consensus picks.

Second, the predictors agreed on few of the other picks.  Three predictors liked WVU over Gonzaga, and the Pain Machine and the Matrix Factorizer agreed on two other games.  Murray State over Marquette is an interesting pick -- another classic trap pick for a predictor that undervalues strength of schedule -- and both Danny's predictor and the Matrix Factorizer "fell" for this pick.

So how did the predictors do?

The Pain Machine was by far the best, getting 43% of its upset predictions correct.  Sentinel was next at 33%.  Perhaps not coincidentally, these two predictors have the most possible points remaining.

In terms of scoring, the Baseline is ahead of all the predictors, so none came out ahead (so far) due to their predictions.  The PM and Sentinel do have a slight edge in possible points remaining over the Baseline.

So who will win?

The contest winner will probably come down to predicting the final game correctly.  There's a more interesting spread of champion predictions than I expected -- particularly given the statistical dominance of Kentucky.

If Kentucky wins, the likely winner will be the Baseline or Danny.  If Kansas wins, the Pain Machine will likely win unless Wisconsin makes it to the Final Four, in which case AJ should win.  If Michigan State wins, then the Sentinel will likely win.  And finally, if Ohio State wins, then Predict the Madness should win.

Upset Review

For the past three years that the Pain Machine has participated in the Machine Madness contest, I've maintained (without any real justification) that the proper strategy is to pick the correct upsets -- as opposed to simply picking the most likely outcome, which will be the higher seed in every case where the committee hasn't completely blown the seeding.  In light of that, I wanted to review the PM's upset-picking strategy and see how it has worked out this year.

The PM predicts the Margin of Victory for each tournament game.  With two exceptions this year, the predicted winner was the higher-seeded team.  Historically, we know that the upset rate in the first round has been around 22%, and the upset rate for the whole tournament around 15%.  (An upset is where a team seeded at least 2 lower than its opponent wins the game.  A #9 over a #8 is not considered an upset.)  In light of this, I force the PM's tournament picks to include 6 upsets in the first round and 5 more in the rest of the tournament.

The picking strategy is fairly straightforward.  First of all, any games where the PM thinks an upset will happen are marked as upsets.  After that, the PM marks the remaining of 6 games in the first round which have the lowest predicted MOVs as upsets and (after recalculating the rest of the bracket based upon those upsets) the remainder of 5 games in the rest of the bracket by the same criterion.

This year, that resulted in these upset picks (predicted MOV shown in parentheses, correct picks bolded) for the first round:

(11) Texas over (6) Cincinnati (-0.6)
(12) Cal/USF over (5) Temple (1.4)
(11) NC State over (6) SDSU (1.9)
(10) Purdue over (7) St. Mary's (3.3)
(10) WVU over (7) Gonzaga (3.3)
(9) UConn over (5) Iowa St. (3.6)

The PM picked 3 of these 6 upsets correctly: USF, NC State and Purdue.  Texas shot just 16% in the first half and still managed to tie the game in the second half but couldn't finish the rally.  The other two games were not very close.  Still, getting 50% correct on upsets is probably pretty good performance.

The PM has the following upsets picked in later rounds:

(2) OSU over (1) Syracuse (-0.8)
(2) Kansas over (1) Kentucky (0.6)
(11) Texas over (3) FSU (2)
(5) New Mexico over (4) Louisville (2.7)
(6) Baylor over (2) Duke (3)

The FSU and Duke upsets cannot happen.  The New Mexico upset did not happen.  The other two games have not yet occurred.

We can also look at the (say) the most likely upsets at each seed position.  These were:

(16) UNC-Asheville vs. (1) Syracuse (16.1)
(15) Lehigh vs. (2) Duke (12.8)
(14) Belmont vs. (3) Georgetown (6.3)
(13) Ohio vs. (4) Michigan (7.9)
(12) Cal/USF over (5) Temple (1.4)
(11) Texas over (6) Cincinnati (-0.6)
(10) Purdue over (7) St. Mary's (3.3)
(9) UConn over (8) Iowa State (3.6)

Again, the PM got 50% correct.

Of course, the PM also missed a number of upsets:

(12) VCU over (5) Wichita St. (9.6)
(10) Xavier over (7) Notre Dame (7.7)
(15) Norfolk St. over (2) Missouri (23.2)
(11) NC State over (3) Georgetown (5.4)

The Norfolk State win really stands out here as the outlier -- it was at least twice as unlikely as the Duke-Lehigh upset.  I don't have the statistic handy, but 23 point upsets have to be greater 1 in a 1000 historically.  (The beating Norfolk St. took in the next round is indicative of how anomalous the first round upset was.)  VCU was a darling upset pick for many, in part due to their Cinderella status last year.  This year's VCU team was considerably weaker, and the win over Wichita State was another very unlikely result.  The Georgetown upset was the least surprising.  The 5 point differential is well within the ~10 point error margin of the PM's predictions.

Overall, I give the PM a very positive grade for it's upset picks.  It's clearly able to identify games where upsets are likely.  I may have to work on how it selects upsets, though.  There isn't a strong correlation between the magnitude of MOV and the likelihood of upset when MOV is under about 6 points, so it may not make sense to pick the games with the lowest MOVs.  It may make more sense to pick upsets based upon other factors.