Tuesday, January 29, 2013

Data Cleansing

Generally I gather my data from ESPN and it's mostly consistent.  They do a pretty good job keeping the data clean.  But I've also gathered data from Yahoo and other sources (such as betting lines).  And any attempt to merge the data from different sources is an adventure in reconciliation.

There's no standard for reporting neutral site games, so sometimes one team is the home team, and sometimes the other team is the home team.

Team names are reported any which way.   Teams like "California State Fullerton Titans" are a nightmare -- CSU Fullerton, CSU-Fullerton, CSU Fullerton Titans, Cal State Fullerton, Cal St Fullerton, Cal St. Fullerton, Cal. State Fullerton, etc., etc.  Surprisingly, the only outright confusion is SDSU, which usually means South Dakota State University, but sometimes means San Diego State University.

Game times are typically reported in Eastern Time.  Which is fine except for Hawaii, whose scores are sometimes reported on the day the game started, and sometimes on the day it ended.  The Alaskan Shootout is a problem, too.  According to ESPN, Oral Roberts played both Loyola-Marymount and Charlotte on 11/22 this year.  Talk about a tough schedule -- fire the AD!

A couple of games were cancelled this year (notably the 11/9 weather games), but a game between BYU and Utah State was postponed when a Utah State player dropped dead on the court during practice.  Luckily the athletic trainer had a defibrillator and restarted the player's heart, saving his life.

The non-D1 teams schools find to play are endlessly fascinating.  Last week Houston Baptist played Ecclesia College, which is so small it barely has a Wikipedia page.  I can't even tell how many students are enrolled there from its website.  The most popular non-D1 opponents this year are San Diego Christian (8 games against D1 opponents, lost every one) and Rochester (6 games, won against Eastern Illinois).   MIT seems to be Harvard's opening opponent every year (?).  Le Tourneau College was founded by a guy who made his fortune inventing earthmoving equipment.


  1. Try Wolfram Alpha for college info. As of 2009 Ecclesia College had 279 students. You can also try http://college-sports.findthedata.org/ they had the information as well. What about going with the official NCAA stat pages for scores and schedules.

  2. Thanks for the info, CS! Ecclesia College scored 40 points on Houston Baptist -- pretty good for a college with <300 students!


Note: Only a member of this blog may post a comment.