Saturday, November 14, 2015

Really, ESPN?

With the first day's games done I fired up my ESPN score scraper to gather up the data and get the season started.

It crashed.

It seems like ESPN chose the first day of the season to roll out a new format (and URL scheme) for their basketball scoreboard page.

To be fair, I wrote my current (Python Scrapy-based) scraper with the help of Brandon Harris (*) and he warned me when I started down this road that ESPN was busy mucking up all their scoreboard pages.  "Oh no," said I, "it looks fine, I'm sure they won't change it at the last second."  So I have no one to blame but myself.

(*) And by help I mean he basically gave me working code.

ESPN went all Web 3.0 in their page redesign, which means that rather than send a web page, they send a bunch of Javascript and raw data and make your web browser build the page.  (Which probably saves them millions of dollars a year in server costs, so who can blame them?)  This breaks the whole scraper paradigm, which is to Xpath through the HTML to find the bits you need.  There's no HTML left to parse.  The good news in this case is that ESPN was nice enough to include the entire URL I need in the data portion of the new format, so it is very easy to do a regular expression search and pull out the good parts.  Otherwise you get into some kludgy solutions like using a headless web browser to execute the Javascript and build the actual HTML page.  Or trying to find the mobile version of the page and hope that's more parseable.

I don't do anything much with the model until after a few weeks of games, so I have some time to fix my code.  And I suppose that if  you want to scrape data from the Web, you'd better be prepared to deal with change.

7 comments:

  1. It's almost better this way. The JSON string is really nice!

    ReplyDelete
    Replies
    1. You have a point. I haven't had a chance to look into parsing the JSON string, but would have some nice benefits. For the moment at least the boxscore pages (where I get most of my data) remain "Web 1.0" so I'm not going to try to parse the JSON. But if the boxscore pages switch over that would be great.

      Delete
    2. Ah, I hadn't looked at the box score page yet, but that's next on my list of things to do after all the real to-dos are done! The nice part about the JSON is that it doesn't require any real work. All the context info (home or away, etc.) isn't positional, it's keyworded, which gets over my biggest dislike of Web 1.0 tables!

      Delete
    3. The box score page is still HTML. If Scrapy has a good way to grab just that particular piece of JSON out of the page, that would be very nice. As you say, much nicer to parse and comprehend than trying to craft Xpath!

      Delete
  2. Glad you were able to recover! I hadn't noticed the JSON strong either, I guess I was too busy cursing out ESPN.. :P

    ReplyDelete
  3. the schedule page http://espn.go.com/mens-college-basketball/schedule/_/date/20141116 is still pretty easy to parse

    where do you find the json data on the scores page?

    ReplyDelete
    Replies
    1. Huh, that's an interesting page. I never would have thought to look at a schedule page for a past date.

      If you go to the Scoreboard front page, you'll find the JSON data stuffed into window.espn.scoreboardData.

      Delete