Tuesday, January 29, 2013

Data Cleansing

Generally I gather my data from ESPN and it's mostly consistent.  They do a pretty good job keeping the data clean.  But I've also gathered data from Yahoo and other sources (such as betting lines).  And any attempt to merge the data from different sources is an adventure in reconciliation.

There's no standard for reporting neutral site games, so sometimes one team is the home team, and sometimes the other team is the home team.

Team names are reported any which way.   Teams like "California State Fullerton Titans" are a nightmare -- CSU Fullerton, CSU-Fullerton, CSU Fullerton Titans, Cal State Fullerton, Cal St Fullerton, Cal St. Fullerton, Cal. State Fullerton, etc., etc.  Surprisingly, the only outright confusion is SDSU, which usually means South Dakota State University, but sometimes means San Diego State University.

Game times are typically reported in Eastern Time.  Which is fine except for Hawaii, whose scores are sometimes reported on the day the game started, and sometimes on the day it ended.  The Alaskan Shootout is a problem, too.  According to ESPN, Oral Roberts played both Loyola-Marymount and Charlotte on 11/22 this year.  Talk about a tough schedule -- fire the AD!

A couple of games were cancelled this year (notably the 11/9 weather games), but a game between BYU and Utah State was postponed when a Utah State player dropped dead on the court during practice.  Luckily the athletic trainer had a defibrillator and restarted the player's heart, saving his life.

The non-D1 teams schools find to play are endlessly fascinating.  Last week Houston Baptist played Ecclesia College, which is so small it barely has a Wikipedia page.  I can't even tell how many students are enrolled there from its website.  The most popular non-D1 opponents this year are San Diego Christian (8 games against D1 opponents, lost every one) and Rochester (6 games, won against Eastern Illinois).   MIT seems to be Harvard's opening opponent every year (?).  Le Tourneau College was founded by a guy who made his fortune inventing earthmoving equipment.

Monday, January 28, 2013

Halftime Scores

Suppose that Maryland plays Duke in Cameron.  Maryland is ahead by 8 points at halftime, but ends up losing by 2 points.  The next week, Wake Forest plays Duke in Cameron.  They're down 1 point at halftime, and also end up losing by 2 points.

On the basis of those games, which is the better team, Maryland or Wake Forest?

There are reasonable arguments for several viewpoints.  Certainly you might argue that there's no reason to think either is better -- they both ended up at the same place and the halftime score is immaterial.  You could also argue that Maryland is better -- they outplayed Duke, at least for a half, which is more than Wake Forest managed.  Or you could argue that Wake Forest is better -- they were within a small margin of error of beating Duke in both halves, which is more than Maryland can say.

Or consider the case where Maryland wins the first half by 8 and loses the second by 10, while Wake Forest loses the first half by 10 and wins the second half by 8.  Is that evidence that either team is better?

In reading papers on rating systems over the past few years, I've noticed that many authors devise a rating system to reflect their personal belief on questions like this one.  I wouldn't be surprised at all to read a paper that said (in effect) "Based upon the halftime score, Maryland is clearly the better team, and here's a rating system that reflects that."  So we have rating systems that discount blowouts, and rating systems that emphasize non-conference road wins, and so on.

As long-time readers of this blog know, my own outlook is different.  What I believe is important or unimportant isn't, well, important.  What counts is whether something improves predictive performance.  So when I started collecting scoring by halves, my purpose was to see how that could be best used to improve prediction.

The first thing I did was to create some baseline statistics based on the scoring by halves, such as a team's average score in the first half, in the second half, average score of opponents in the first half, average MOV by half, and so on.  I didn't expect these statistics to have much predictive value.  For one thing, it seems clear that the strength of the opposing team is an important factor in understanding a team's performance, and none of these baseline statistics reflect that strength of schedule.

Still, I believe in testing over assumptions, and testing revealed at least one statistic that did have some predictive value: the ratio of a team's scoring in the first half to the scoring in the second half.  As I hinted here, there's some correlation between team strength and the ratio of scoring by halves.  Good teams generally have high ratios -- that is, they do more of their scoring in the first half than the average team does.  That's a pretty intriguing result.  Some work by Monte McNair shows that teams generally improve their offensive efficiency as the game progresses, so it may be that good teams play more efficiently from the start of the game.  There's probably more interesting results to be had by analyzing and understanding this result.

After testing the baseline statistics, I turned my attention to using the scoring by halves with strength measures like RPI, Trueskill and so on.  These measures try to assign a team a single numeric strength value based upon game outcomes.  I wanted to try to extend the measures to include the scoring by halves information and see whether that improved the predictive value of the measures.

There are several ways to go about this, but one straightforward approach is to treat each half like another separate game.  So in the case of Maryland above, we'd calculate our measure as if Maryland had played Duke three times -- winning once by 8, losing once by 10, and losing once by 2.  We can also try variants, such as using only the first halves of games.  So we can calculate all the variants and test to see which one has the best predictive value.

I've initially applied this approach to Trueskill.  To begin with, I measured the performance of the baseline Trueskill-MOV metric on my current test set.  This is currently the best single predictive measure in the Performance Machine's metrics.

  Predictor    % Correct    MOV Error  
Baseline Trueskill-MOV72.7%11.59

The first tests were to calculate the metric based on just the halves individually, and then using all three results.

  Predictor    % Correct    MOV Error  
Baseline Trueskill-MOV73.6%11.59
First Half Only72.1%11.97
Second Half Only70.9%12.69
All Three73.4%11.55

There are a couple of things to note here.  As you might guess, neither half by itself is better than using the game score.  More surprising is that performance in the first half is much better for prediction than performance in the second half.  (To go back to our second example above, this is reason to believe that Maryland is the better team than Wake Forest.)  And using all three together is marginally better (at least in MOV Error) than using just the final score.

So far this treated each half like a separate game.  But one could argue that a Margin of Victory of 4 in a half is the equivalent of an MOV of 8 in a whole game.  We can test this by applying various modifiers to the scores and how they are bonused in the algorithm.  The best results I could find were these:

  Predictor    % Correct    MOV Error  
Baseline Trueskill-MOV73.6%11.59
Best First Half Only73.2%11.78
Best Second Half Only72.2%12.25
All Three73.4%11.51

With tweaking all of the variants could be improved somewhat.  Using all three was about a 1/10 of a point improvement on the baseline.

Another possibility is to use the two half scores and ignore the game score.  With some tweaking to count the first half about twice as much as the second half, this turns out to be very effective:

  Predictor    % Correct    MOV Error  
Baseline Trueskill-MOV73.6%11.59
Only Half Scores74.3%11.46

I find this a pretty surprising result.  Getting a better strength metric by ignoring the game outcomes is non-intuitive (to say the least) and goes against the typical sports punditry about how winning is the only thing that matters.

This metric is the single best metric in the PM's arsenal, and was used to generate the PM's Top Twenty.

Saturday, January 26, 2013

PM Top Twenty (1/26)

Here's the Prediction Machine's current top twenty:

5Ohio St.111
19Oklahoma St.89.2
20Kansas St.88.3

This is based upon the PM's single best prediction metric.  Florida has a big lead on the field, and there's an even more significant break between Michigan (#4) and tOSU (#5). 

I'm going to try to post this every Saturday.

Thursday, January 24, 2013

Oddball Statistic

I've recently been modifying my data collection to include the score by halves, and in looking at the data, I discovered this oddball statistic: Only 11 NCAA teams average more points in the first half than in the second half:

Loyola (MD)  (#2 MAAC)
Charleston Sou. (#1 Big South, South)
New Mexico St. (#3 WAC)
Indiana (#2 Big Ten)
La Salle (#5 Atlantic Ten)
Oregon (#1 Pac-12)
Oregon St. (#12 Pac-12)
Holy Cross (#3 Patriot League)
Kansas (#1 Big-12)
Lehigh (#1 Patriot League)
Saint Louis (#10 Atlantic Ten)

With the notable exceptions of Oregon State and St. Louis, these are all good to excellent teams.  What's the connection?  Why do almost all teams score more in the second half than in the first half, and why would the few that don't be generally better than average?

Tuesday, January 22, 2013

Team to Team Variation in Predictions

The current version of the Prediction Machine averages about 11 points of error in predicting games, across all teams and all seasons.  I've speculated that there are some subsets of games where the error is significantly less -- for example, it might be the case that we can predict much more accurately when a good rebounding team plays a poor rebounding team.  However, my efforts to identify those subsets have been largely futile and there's some circumstantial evidence to suggest that no subsets exist -- primarily that a Support Vector Machine does no better than a Linear Regression at prediction.  (We would expect a SVM to do better in a data set with significant subsets.)

Last week I thought it would be interesting to look at what teams the PM has done the best at predicting this year and which ones the worst.  (For some reason, it's never occurred to me previously to look at this.)  So I gathered up all the predictions and results for this season and segmented them out by teams.  (Note: I'm only looking at the games after the first 1000 games of the season and not the last 100 in this sample.)
The overall best team is Idaho, which the PM has predicted with about 4.6 points of error.  The PM has gotten 5 of Idaho's games within 2 points.  It missed one game by 10 points, but that was by far the worst.
The overall worst team is Mississippi St with 18 points of error.  The PM missed games by 35, 29, 17, and 15 points.  So the overall range of predictions runs from less than 1/2 the average error to almost 2x the average.
I also took a look at the error for home games and away games separately.

Just looking at home games, the most predictable is TX Pan American (2.3 points error) and the worst Youngstown State (28.7 points error).  For away games only, the most predictable is Portland St (0.86 points error!) and the worst is Maryland (20.5 points error).   For Portland State's four away games, the PM was off by 0.7, 1.3, 0.4 and 0.7 points (!).  Maryland is a bit deceptive -- they only have two away games in the sample, and one was the Northwestern game which they were expected to lose by 9 and won by 20.

Some of this is no doubt just random variation.  Just by chance the PM will get some team's games close and some team's far off.  That effect should diminish the more games we sample, so I took a look at the entire 2012 season.

The overall best team to predict in 2012 was Dartmouth, with 6.3 points of error on average.  The worst was New Orleans, with 19.3 points of average error.  Once again we see a range of roughly 1/2 to 2x the average error.  The best home team to predict was Indiana St, at 4.7 points of error, and the worst Longwood at 17.15 points of error.  The best and worst away teams were Gardner-Webb (4.14) and New Orleans (22).   So again we see the overall range of predictions runs about 1/2 the average error to about 2x the average error.

Another test we can do is to look at how many teams that were predictable in the first half of the season are also predictable in the second half of the season.  If the effect is random, we'd expect to see a random level of overlap.  For the 2012 season, if we look at the most predictable half of the teams in both the first part of the season and the second part of the season, there's almost exactly 50% overlap -- a strong indication that the effect is just random variation.

The conclusion is that the error range on the PM's predictions for particular teams runs from about 1/2 the overall average to about 2 times the average, but that this variation is probably random.

Sunday, January 20, 2013

Here are some results from Saturday's predictions:

Away Home Pred Line Actual Pred vs. Line
#6 Syracuse #1 Louisville 6.8 6.5 -2Lose
#4 Kansas Texas -10 -10 5Push
#7 Arizona ASU -6 -7 17Win
#8 Gonzaga #13 Butler -4 -3 2Lose
#17 Missouri #10 Florida 9 13 31Lose

#11 OSU
#18 MSU -1 2.5 3Lose
#12 Creighton WSU 1 2.5 3Lose
#15 SDSU Wyoming 5.5 -1.5 13Win
Oklahoma #16 KSU 6 6 9Push
#21 Oregon #24 UCLA 2
#22 VCU Duquesne -13 -14.5 -27Lose
#25 Marquette Cincy 4
3 (OT)
E Illinois Austin Peay 0 5 -10Win
E Kentucky Jacksonville St 0
Miami (OH) E Michigan 0 1.5 7Lose

A couple of notes.

The PM's prediction is often very close to the line for the game.  There are few games where it differs by more than a few points.  In this set of games, the Florida game, the Wyoming game and the Austin Peay game stand out with more than 4 points difference with the line.  This suggests that Vegas uses very similar technology to help set the lines.

It's interesting to note that the three games the PM thought would be close weren't particularly.

The PM didn't do particularly well on this set of games, although it did win 2 out of 3 where it differed significantly from the line.  In 3 games where it differed on the winner with the line, the PM also won 2 out of 3.  Neither result is significant, but the PM does slightly outperform the line in general.  The PM "bets" on games that meet certain criteria, including a significant point difference from the line.  Of these games, two met all the criteria:  the Austin Peay game and the Wyoming game, both of which the PM won.  Eight games total met all the criteria on Saturday, and the PM "won" five of those games.  For the year, the PM is 55% against the line for games that meet all the criteria.

Saturday, January 19, 2013

Here are some predictions for today's games from the Prediction Machine (PM):

#6 Syracuse @ #1 Louisville:  Louisville by 6.

Syracuse is not good enough to beat Louisville at home.  The return game on 3/2 could be more interesting.

#4 Kansas @ Texas:  Kansas by 10.5

Texas just not a good team this year.

#7 Arizona @ ASU:  Arizona by 6.

ASU has lost to all the decent teams they've played, but might be good enough to surprise someone one of these days.  Maybe Arizona?  Maybe UCLA?

#8 Gonzaga @ #13 Butler: Gonzaga by 4

Should be a good game.

#17 Missouri @ #10 Florida: Florida by 9

Florida is probably under-ranked right now.

#11 OSU @ #18 MSU:  OSU by 1

Basically a toss-up.  The Big Ten looks tight this year, so could end up being an important game.

#12 Creighton @ Wichita State:  Wichita State by 1

Creighton seems be a dark-horse darling for a lot of analysts, but the PM isn't impressed (yet).

#15 SDSU @ Wyoming: Wyoming by 5.5

Wyoming screwed the PM the other night by losing by 13 on the road to Fresno State, but the PM is doubling down on a big home win.  We'll see.

Oklahoma @ #16 KSU: KSU by 6.
Rutgers @ #20 NDU: NDU by 9.5

Should be routine wins for both home teams.

#21 Oregon @ #24 UCLA: UCLA by 2

Basically a toss-up.

#22 VCU @ Duquesne: VCU by 13

VCU is another analyst darling and will not be challenged by Duquesne.

#25 Marquette @ Cincy: Cincy by 4

Unless a fight breaks out.

Other Interesting Games:

   Maryland @ UNC: UNC by 7.  (I would have thought this would be closer.)
   Kentucky @ Auburn: UK by 9.  (Surprised it's not more.)


  Nicholls @ Stephen F. Austin:  Austin by 39
  Portland @ St. Mary's: Portland by 13


  Eastern Illinois @ Austin Peay:  Austin by 0.06
  Eastern Kentucky @ Jacksonville St.: Jacksonville by 0.08
  Miami (OH) @ Eastern Michigan:  Eastern Michigan by 0.09

For some reason the Eastern schools are playing tight today.

Made some interesting progress this week on some aspects of the predictor, and I hope to report out on them soon.

Monday, January 14, 2013

Prediction Season

Please pardon the unintentionally long hiatus.  The holiday season, a trip, and some program problems have interfered with my regularly scheduled prophecy.

It takes about 1000 games before my predictor settles down to reliable predictions.  This usually occurs around the start of January.  Two problems interfered with that this year.  First, I started scraping game results from ESPN this year (Yahoo has gotten somewhat unreliable) and only discovered around the beginning of January that I wasn't getting all the games.  It took a few days to fix that problem.  Then I ran some tests and found my predictions were way off.  Tracking down the cause took another week or so.  The exact problem was complex, but it had to do with determining the length of games.  (For pace-adjusted statistics you need to know the number of minutes played.)  Once that was fixed the predictor started behaving as expected.

So here are some predictions for tonight's games, starting with the ranked teams:

  #1 Louisville (15-1) @ Connecticut (12-3): Louisville by 8
  Baylor (11-4) @ #4 Kansas (14-1) : Kansas by 8

Connecticut is probably under the radar right now -- they've got good wins over MSU and Notre Dame, and their loss to NC State looks better after NC State's win over Duke.

Most competitive games of the night:

   Elon (8-7) @ Western Carolina (7-9): Elon by .2
   Savannah St. (7-9) @ Morgan State (4-8):  Morgan State by .5

Blowouts of the night:

  Grambling St. (0-14) @ Texas Southern (4-13):  Texas Southern by 25
  Charleston (11-5) @ Citadel (3-11): Charleston by 13

It's amazing that a 4-13 team can be a 25 point favorite in a game, but Grambling is really bad and has a good chance of going winless for the season.

As usual, these predictions are provided for your reading pleasure only.  A full disclaimer can be found in the sidebar.