Sunday, December 6, 2015

A Few Funny Things

When I logged in to work on this post, I noticed that my blog had 100,000 page views.  Since I have an audience of like six people, you guys must be checking my pages a lot.  Good job!  Anyway, I've been spending some time lately getting my data scraping working, and that always involves a few trips through the bowels of data validation.

First stop is this game.  I was running the predictor when it warned me about an unusual event:  a conference game in early November.  Unusual, but it happens (often a Big5 game).  What was more surprising was that it was a team playing itself.  According to the predictor, UNC Greensboro had come up with the clever notion of scheduling a home game against itself.  Or maybe it was on the road. 

One of the challenges of predicting NCAA basketball is that every data source uses different names for teams.  To try to match them up I have lists of alternate team names:

St. Francis (NY)
1383
St. Francis BRK
St. Francis (N.Y.)
St. Francis-NY
 St Francis NY
St Francis(NY)
St Francis (NY)
St. Francis Brooklyn
St. Francis NY
St. Francis-NY Terriers
St Francis (BKN)
st.-francis-(NY)-terriers
St. Francis (BKN)
(That weird-looking "1383" is the name for St. Francis (NY) in the Kaggle contest.  Because it's run by data scientists, so why use a human-readable name when you can use an arbitrary and completely useless number?)

In this case the predictor too aggressively (although reasonably) determined that Div III Greensboro College was a nickname for UNC Greensboro.  (By the way, my list of nicknames and the Python code that goes with it is available for the asking.  But you're on your own dealing with Greensboro vs. Greensboro.)

Next up is this game.  Looks like a perfectly reasonable WAC Conference game.  Problem is, one of those teams was not in the WAC.  Actually, one of those teams didn't even exist.

You see, last year the University of Texas decided to merge two campuses -- the University of Texas Brownville and the University of Texas Pan American -- to form a brand new campus the University of Texas Rio Grande Valley.  Brownsville didn't have sports, but UT-PA was a Division I team in the WAC, so the new campus stayed in the WAC and became the "Vaqueros."

(Trivia Question:  Name the other four NCAA Division I basketball nicknames that are Spanish words.)

Well, ESPN decided the easiest way to deal with this whole business was to just go into their database and replace every instance of "University of Texas Pan American Broncs" with "UTRGV Vaqueros."  Hence the mysterious 2013 game involving a university that wouldn't exist for several more years.

1 comment:

  1. It always seems like more time is spent on human readability and data conditioning than anything else!

    ReplyDelete

Note: Only a member of this blog may post a comment.