Monday, December 30, 2013

Top Twenty, Predictions (12/30)

Prediction Machine's Top Twenty

1 Oklahoma St. 35.8
2 Louisville 34.0
3 Arizona 33.1
4 Arkansas 32.7
5 Ohio St. 32.7
6 Iowa St. 32.6
7 Iowa 32.4
8 Kentucky 32.2
9 Villanova 31.9
10 Arizona St. 31.9
11 Duke 31.7
12 Creighton 31.6
13 Michigan 31.4
14 Syracuse 31.1
15 Kansas 31.0
16 Pittsburgh 30.8
17 Colorado 30.8
18 Oregon 30.8
19 Cincinnati 30.7
20 Wisconsin 30.6

The big mover this week was Syracuse jumping up 6 spots to #14 based on the solid win against Villanova -- who dropped 5 spots.  Arkansas and tOSU both moved up 3 spots.  I'm not sure why in either case, although tOSU did have a win over ND last week.  Oregon dropped 3 spots after squeaking out an OT win over BYU at home.


Not a lot of interesting games this week, although there are 3 Top 25 vs. Top 25 matchups next Sunday for some reason.

#3 tOSU vs. Purdue:  tOSU by 12

HCA won't be enough for the Boilermakers

Indiana vs. Illinois:  Indiana by 1

In the running for squeaker of the week.  If Indiana wins at Illinois and then upsets MSU later in the week, they'll probably be ranked next week.

St. Mary's vs. #24 Gonzaga: Gonzaga by 4
Good chance for an upset here -- Gonzaga is probably over-ranked.  A win by 4 at home is basically saying you're even with the visiting team.

#9 Duke vs. Notre Dame: Duke by 3
Duke outperformed last week, but this might be a trap game for them.

#5 MSU vs. Indiana: MSU by 2
Indiana's chance to get ranked...

#22 Iowa vs. #4 Wisconsin:  Toss-up
Coin Toss of the Week.  The PM favors the Badgers by a tenth of a point, but that's down in the noise.

UNI vs. #10 Wichita State:  Wichita State by 5
UNI's probably not good enough to win this on the road.

#12 Oregon vs. #21 Colorado:  Colorado by 3
The PM has these teams as nearly identical strength, so this goes to Colorado playing at home.

#20 SDSU vs. #16 Kansas:  Kansas by 4.5
The PM's not nearly as fond of SDSU as the AP -- it has them down around #60 in the country -- so this should be a pedestrian win for Kansas.


Robert Morris vs. #7 Oklahoma State:  OkSt by 30
The PM effectively caps predictions around 30 points MOV, but this might stray into the 40 point range.

Monday, December 23, 2013

Top Twenty (12/23)

Now that the Redskins are dead and buried, I'm turning my attention more to college basketball.  Here's the season's first Top Twenty from the Prediction Machine.

Top Twenty

1 Oklahoma St. 35.8
2 Louisville 34.6
3 Arizona 32.8
4 Villanova 32.6
5 Iowa St. 32.6
6 Iowa 32.4
7 Arkansas 32.3
8 Ohio St. 32.1
9 Arizona St. 32.0
10 Kentucky 31.7
11 Creighton 31.3
12 Duke 31.3
13 Michigan 31.1
14 Kansas 31.0
15 Oregon 30.9
16 Pittsburgh 30.8
17 Colorado 30.6
18 Florida St. 30.4
19 Wisconsin 30.3
20 Syracuse 30.3

Some early-season shakeout still going on in the Top Twenty.  Oklahoma State and Louisville are head-and-shoulders above the rest of the nation.  Oklahoma State lost some ground this week to Louisville despite a solid win over Colorado.


There's a dearth of interesting games until next Saturday.

#8 Villanova at #2 Syracuse:  Nova by 4
The PM has Nova at #4 and Syracuse at #20, so don't expect Syracuse to have an easy time of it.  Home court advantage is worth a lot, though, so it won't be a big surprise if Syracuse wins this game.

#6 Louisville at #18 Kentucky: Louisville by 8.5
The PM thinks better of both teams than the AP, but Louisville (and OKSt) is head-and-shoulders above the rest of the Top Twenty, so the PM thinks they'll win comfortably at Lexington.

Providence @ #23 UMASS!!!:  UMass by 1
Good chance for an "upset" in this game.

#25 Missouri @ N.C. State:  Mizzou by 1
Another good upset possibility.


Blowout of the Week


Wisconsin over Prairie View A&M by 29


Coin Toss of the Week

Boston U @ St. Joseph's

Thursday, October 24, 2013

Local Regression

The Prediction Machine uses a linear regression to form its predictions.  A linear regression works by calculating a straight line equation (hence "linear") that best fits the observed historical data.  That looks something like this picture:
Given a new X we use the blue line to predict a value for Y.  The Prediction Machine isn't two-dimensional like this illustration -- it has dozens of inputs rather than just one -- but this gives you the general idea of how it works.

It turns out that a linear regression works pretty well for the Prediction Machine.  But I've wondered whether there aren't "special cases" hidden in the data where the equation that best fits all the data doesn't work well for the special cases.  For example, you might think that teams that are very good at getting offensive rebounds could be predicted more accurately with a slightly different equation.  If you could pick out those cases and use a different linear regression, overall accuracy would improve.

There are a number of different approaches to doing this.  One is to use a more complex regression, so that the "blue line" can bend more flexibly in different regions of the prediction space.  For example, you can use a polynomial regression:

But a polynomial regression still bends "smoothly" and is limited in how many times it can bend.

Another approach is to predict a game's outcome based upon its nearest neighbors, as I talked about here.  The shortcoming with this is that the prediction is based upon the average of all the nearby neighbors -- which might not be the right estimation.  A more sophisticated model (such as a linear regression) might work better.

Local regression is a modeling technique that combines nearest neighbors with regression.  It works by finding the nearest neighbors to the example you want to predict, creates a linear regression using just those neighbors, and then predicts the example using the linear regression.  If your data really has "neighborhoods" that act differently, this should do a better job of prediction.

Local regression was recently added to RapidMiner so I took the opportunity to apply it to the Prediction Machine to see if it would improve performance.

The results were disappointing and/or enlightening, depending upon your perspective.  Performance of localized regression was much poorer than a linear regression for small numbers of neighbors.  It wasn't until the number of neighbors was greater than 2000 that its performance started to approach the performance of the linear regression.

This confirms earlier experiments suggesting that there aren't localized "neighborhoods" within the NCAA basketball data where we can improve performance by treating them differently.  The factors that predict performance seem to apply equally across the whole spectrum of college teams.

Wednesday, October 2, 2013

2014-2015 Schedule Available

The Prophet isn't completely awoken yet from his off-season hibernation, but is rousing long enough to mention that ESPN recently posted the first schedule of games for the season.  They don't seem to have locations yet for the neutral-site games and there are probably missing games.  I've posted the scraped file to the Data page.

Looking back, I see the Prediction Machine got the participants in the final games of both the NIT and the NCAA tournaments correct, and predicted all of the Final Four games correctly.

Thursday, April 4, 2013

NIT Final

Not to brag or anything, but about a month ago, the Prediction Machine crowned Iowa as the best team that didn't get into the Tournament, with Baylor right behind.

Tonight, Iowa met Baylor in the final game of the NIT.

(Virginia, also mentioned in that posting, lost to Iowa in the quarterfinals of the NIT.)

Wednesday, April 3, 2013

Final Four Predictions

The Prediction Machine doesn’t have a good record in the Tournament this year, but I console myself that no one else does, either.  (And I'm glad I didn't publish Elite Eight predictions -- they would have been mostly wrong!)  After the early round games, and with the exit of most of the high seeds, the PM thinks Louisville is the cream of the remaining crop:

(1) Louisville (9) Wichita St. 12.4
(4) Michigan (4) Syracuse 5.4
(1) Louisville (4) Michigan 10.2
(1) Louisville (4) Syracuse 10.9

The Prediction Machine likes Michigan over Syracuse, but that game represents a bit of a predictive dilemma because it involves two #4 seeds facing each other.  Normally the better seeded team is the Home team, and benefits from the Tournament version of Home Court Advantage.  (*)  It isn’t clear how to resolve this when two identical seeds face each other – this happens so rarely in the Tournament that there isn’t clear precedent.  In this case, the PM rates the teams nearly identical, and predicts the win for whichever team is the “home” team.   Since the NCAA has chosen Michigan as the home team, I’ll go with that.  However, I’ve shown both possible matchups in the final game just in case.

(*) Why is there a home court advantage in the Tournament?  My theory: The Home Court Advantage derives largely from referee bias.  During the regular season the referee bias is that “teams play better at home” and so they give the home team the benefit of calls, etc.  During the tournament the bias is “the better seeded team is better” and so that team gets the benefit of calls.

Saturday, March 30, 2013

Sweet Sixteen Update, Part 2

The second half of the Sweet Sixteen games have been played, so here’s the update on the Prediction Machines thoughts:
Miami (FL) Marquette 6 3.7 -2.3
Louisville Oregon 10 11.6 1.6
Ohio State Arizona 3.5 5.4 1.9
Indiana Syracuse 5.5 7.7 2.2
Duke Michigan State 2 4.3 2.3
Kansas Michigan 2 4.9 2.9
Wichita State La Salle 4 7.6 3.6
Florida Florida Gulf Coast 12.5 20.2 7.7

The PM went 1-2 on betting predictions.  The Louisville-Oregon prediction was too close for a bet, but the PM would have been on the wrong side there as well.

Michigan-Kansas was the surprise result to me – I really expected Kansas to win that game fairly handily.  They might have still squeaked out the win if not for some bone-headed late-game plays.

I’m traveling again today, so I don’t know if I’ll manage to get out Elite Eight predictions.

Friday, March 29, 2013

Sweet Sixteen Update (Part 1)

The first half of the Sweet Sixteen games have been played, so here’s the update on the Prediction Machines thoughts:
Miami (FL) Marquette 6 3.7 -2.3
Louisville Oregon 10 11.6 1.6

Ohio State Arizona 3.5 5.4 1.9
Indiana Syracuse 5.5 7.7 2.2
Duke Michigan State 2 4.3 2.3
Kansas Michigan 2 4.9 2.9
Wichita State La Salle 4 7.6 3.6
Florida Florida Gulf Coast 12.5 20.2 7.7
The PM went 2-1 on betting predictions.  In its Tournament bracket, it correctly picked Marquette over Miami.  Unfortunately, in the most significant game of the night for the PM, it was very wrong on the Indiana game.  Indiana struggled all night against the Syracuse zone, and (particularly in the first half) Syracuse seemed to make every shot attempt.  Having watched Syracuse’s first two round games at San Jose, I was expecting the zone to cause Indiana more trouble than expected, but I didn’t expect Syracuse to look so good on the offensive end.  At any rate, unless something very unusual happens, that result eliminates the PM from contention in the Machine Madness Contest.  This will be the first year where the PM didn’t go into the final game with a chance to win the contest.
The Arizona-Ohio State prediction was too close to recommend a bet, but the PM ended up on the wrong side of the line there as well.   Ross’s defensive blunder on the next-to-last play of the game probably cost quite a few gamblers a payout.

Wednesday, March 27, 2013

The Prediction Machine’s Bracket

The Prediction Machine is primarily focused on picking the margin of victory for regular season games, but I also use it to create a bracket for the Machine Madness Contest.  The contest has been going on for a few years, and my approach to picking a bracket has evolved.

Initially, the Prediction Machine picked the most likely winner of each game – whichever team it deemed stronger.  But there’s a serious drawback to this approach.  The Committee is already pretty good at determining the relative strength of the teams, so by and large the Prediction Machine’s picks agreed with the seedings.  It only differed where the Committee had “mis-seeded” teams.  That seems to happen every year, but there’s usually only one or two mis-seeds.  So you end up with a bracket that may be the most likely outcome, but which is also going to be very similar to many other brackets.  (In fact, we see that very thing in this year’s Machine Madness competition: “Danny’s Dangerous Picks” and “Predict the Madness” are identical after the second round.)  This makes it very hard to finish high in a pool with a lot of entrants.

In the next iteration, I forced the Prediction Machine to pick about 15% of the games as upsets.  I chose that number because historically, that’s about how many upsets there are each Tournament.  The Prediction Machine did this by ranking the upsets and selecting the top 6 upsets in the first round and 5 more in the rest of the tournament.  The idea was to get away from the consensus picks of the other competitors while picking the most likely upsets.  But this is too risky a strategy.  Depending upon the size of the pool, you probably don’t need to get 11 upsets correct to do very well.  For example, in last year’s Machine Madness pool, it would have been sufficient to get 8 points from upsets – which could be just one correct upset pick in the round of 8.

This year, the Prediction Machine used an algorithm which took a target number of upset points and tried to select the most likely set of upsets to meet that total.  Initially I planned to use a target number of 8 points – based on last year’s results – but in the end decided to set the target higher, with the goal of ending up in the top 5% of the ESPN contest if the upsets occurred as predicted.  I placed that goal at (a somewhat arbitrary) 50 points.  I then used the Prediction Machine to predict all the chalk matchups in the tournament.  This identified a number of games where the Prediction Machine thought the lower-seeded team would win:

Home Away
Georgetown Florida
UCLA Minnesota
Kansas St. Wisconsin
Colorado St. Missouri
Memphis St. Mary's
New Mexico Arizona

This adds up to 11 points of mis-seeds.  That’s a surprising number and may reflect an unusual basketball season.  When I plugged these upsets in and ran the tournament again, I discovered that the Prediction Machine also favored #3 Florida over #1 Kansas (an 8 point game), so I added that in for 19 total points of mis-seeds.

The PM then identified the most likely upsets in the remaining games.  These were the top results:

Home Away
Gonzaga Ohio St. 17.5
Notre Dame Iowa St. 17.4
Miami (FL) Marquette 17.3
Louisville Indiana 14.1

The PM then added upsets in order of likelihood until it reached 50 (or in this case, 64).  (The next upset on the list was Oklahoma over San Diego State.)

There are a couple of refinements to this approach that I haven’t had time to incorporate.  A simple refinement would be to drop 14 points of upsets to get back to 50 points.  A more complex refinement would be to try different combinations of upsets to get the most likely combination that reaches the target points.  Either refinement in this year would have ended up keeping just the Louisville-Indiana upset in the final game.

It’s just as well that I didn’t have time to implement either refinement.  This year’s Machine Madness field turned out much larger than expected (27 competitors!) and even if Indiana wins everything, I won’t win the competition unless Marquette beats Miami – one of the upsets that would be dropped to get back to 50.

Looking at the Prediction Machine’s performance, in the first rounds it went 2-2 for mis-seeds/upsets, and in the second round 1-1.  50% correct on picking upsets is probably a pretty good performance.  In the ESPN competition, the Prediction Machine’s bracket is at 94.4% out of about 8 million entries, with 7 out of the Round of Eight still alive.

Tuesday, March 26, 2013

Sweet Sixteen Predictions

Here are the Prediction Machines thoughts on the Sweet Sixteen games:






Miami (FL) Marquette 6 3.7 -2.3
Louisville Oregon 10 11.6 1.6
Ohio State Arizona 3.5 5.4 1.9
Indiana Syracuse 5.5 7.7 2.2
Duke Michigan State 2 4.3 2.3
Kansas Michigan 2 4.9 2.9
Wichita State La Salle 4 7.6 3.6
Florida Florida Gulf Coast 12.5 20.2 7.7

The PM likes mostly home teams, although it thinks Marquette +6 is a good bet.  The PM has Marquette picked as a likely upset in its bracket.  It needs Marquette to win this game and Indiana to win out in order to finish first in the Machine Madness Contest.  (I’ll have a blog post shortly about how the PM picked its bracket.)

At the other end of the spectrum, the PM likes Florida to crush FGCU, even with FGCU's recent victories taken into account.  I'm dubious.  I'm also dubious of the Indiana prediction, given how impressive Syracuse was in San Jose.  And the PM has liked Wichita State all along, and is looking for a fairly routine victory over La Salle.

The PM doesn't usually see this many "bettable" games, where the difference between the Vegas line and the PM’s prediction is greater than 2 points.  It’s likely that – since there are many more regular season games to train upon – the PM doesn't do as good a job accounting for the Tournament conditions as Vegas.  Alternatively, it may be doing a better job, or the lines may be more influenced by betting during the Tournament when there’s more action.

Monday, March 25, 2013

Upset Picks Review

I previously posted the Prediction Machine's top ten upset picks for the first round.  How did it do?  In the table below I've added a column for MOV -- a negative number indicating that the underdog (Away Team) won the game.

Notre DameIowa St.17.4-18
San Diego St.Oklahoma7.115
MemphisSt. Mary's2.52
Oklahoma St.Oregon2.4-13
N.C. StateTemple2.2-4
Colorado St.Missouri1.812
PittsburghWichita St.1.6-18
North CarolinaVillanova1.27

According to the PM, the Notre Dame-Iowa St. game had significantly higher upset chances than any other game, and in fact Iowa State won the game handily.  The PM also liked Oklahoma to upset San Diego State, but State won that game handily.

The next tier of upset possibilities (> 2) was less likely, but the PM also went 50% in this tier.  (Actually 75%.  Although I marked it as a missed upset here, after the play-in game, the PM had St. Mary's as an outright favorite in the game against Memphis, so the Upset probability was for actually for Memphis to win!  The Upset metric remained almost the same, by the way.)

The next tier is below the cutoff for consideration as an upset, although both the Wichita State and Cal upsets were identified here.  (And Cincinatti, Missouri and Bucknell were all popular upset picks by pundits.)

Combined with last year's results, it appears that the PM's algorithm for detecting likely upsets works fairly well.  (Note that whether the PM should include the upset in its bracket is a different question!)

Saturday, March 23, 2013

Machine Madness: Some Final Four Analysis

Games don't start here in San Jose until 4 pm, so I've started to take a look at the Machine Madness entries.  Here are some observations about the Final Four predictions:

Consensus Champion

To the extent that there is consensus, the predictors seem to think the champion is going to come out of the Midwest bracket -- both Duke and Louisville get five "votes" as champion.  The next most popular picks are Florida (4), Indiana (3) and OSU (2).  The predictors are particularly dismissive of Kansas and Gonzaga as #1 seeds -- they both get only one vote for champion.  Least Likely champion prediction goes to LA's Machine Mad pick:  Notre Dame.  That ain't happening :-).

Final Game

Looking at who the predictors think will be in the final game, Louisville dominates with 10 votes, followed by Florida (7), Kansas (6), Indiana (6), Duke (5), Gonzaga (3) and OSU (2).  Florida gets a lot more love from the predictors than from the Committee; clearly the predictors think Florida is closer to a 1 seed than a 3.  Conversely, the predictors don't think much of #2 seed Miami, which only got one vote to get to the final game.

Final Four

As mentioned, Louisville and Duke dominate the machine picks for the Midwest Region.  In the West, Gonzaga gets 11 votes, followed by tOSU with 6 votes.  Interestingly, New Mexico got 3 votes to win the West -- that obviously won't happen, but I'm curious to see what those predictors have in common that led to that conclusion.  In the South, Florida gets 13 votes and Kansas 7.  In the East, Indiana is the overwhelming favorite with 17 votes -- interesting because only 3 of the predictors see Indiana winning everything.

Craziest Final Four

We have to combine predictions across different entries, but: Missouri, Notre Dame, UNC, Davidson.  Seems unlikely :-)