Friday, February 15, 2013

Web-Harvest Tutorial (Part 3)

I left off last time with a script that would read a date from a file and then fetch the ESPN Scoreboard web page for that date:


Today I'll show how to pull information out of the webpage and save it -- specifically, we'll pull the teams and scores out of the page and save them off for later use.

Find the Information

The first step is to figure out where the information we want is in the web page.  This is easier said than done on modern web pages, which tend to be impenetrable morasses of Javascript, HTML and CSS.  One way to get started is to use the "View Source" option on your web browser (or save the web page onto your computer and view it with your favorite text editor) and then search for text you can see from the web page.  For example, if we go the ESPN Scoreboard page for 11/24/2012, we can see that the first listed game is Duke versus Louisville.  If we do "View Source" and search for "Duke", we find this as the first reference:
<a title="Duke" href="http://espn.go.com/mens-college-basketball/team/_/id/150/duke-blue-devils">Duke</a>


And, this in fact, is the HTML code that creates the "Duke" text in the Duke vs. Louisville scoreboard.  With some more digging, re-formating and so on, we can eventually see that the information about Duke is in an HTML structure that looks like this:
<div class="team visitor">
  <div class="team-capsule">
      <span id="323290097-aTeamName">
    <a title="Duke">
      Duke</a>
      </span>
  </div>
  <ul id="323290097-aScores" class="score" style="display:block">
    <li class="final" id="323290097-awayHeaderScore">
      76</li>
  </ul>
</div>
The information about Louisville is in a structure that is identical except that it starts with "team home" instead of "team visitor."

So now that we know where the information is, we need to pluck it out and put it to use.

The Power of XPath

Web-Harvest uses Xpath extensively to dig information out of webpages.  Xpath is a notation for specifying where to find something in an XML file.  It's a "path" from the top-level of the XML down to some particular piece (or pieces) of the XML.  It's very useful and very powerful, but like regular expressions can be confusing and difficult to use.  If you don't know anything about XPath, you might want to go off and read a tutorial about it to familiarize yourself with how it works.  It's also very useful to have an XPath tester for working out the correct paths for the information you're trying to get.

In fact, Web-Harvest itself provides a very handy XPath tester.  To see it's use, run the above script to fetch the ESPN page, and then use the left-hand pane to see the value of the "webpage" variable (also as shown above).  Now click on the magnifier icon to the right of the "[Value]" box and you'll get a pop-up window showing the text of the webpage:


Notice the "View as:" option in the top left of the pop-up.  Click here and select XML.  This will show the webpage in XML format:


This view has a couple of handy features.  First, you can use the "Pretty-Print" button at the top to reorganize and cleanup the XML for easier viewing.  Second, you'llsee a box at the bottom labeled "XPath expression."  If you type an Xpath into this box, Web-Harvest will run that XPath against the displayed XML and show the result.  For example, try typing the Xpath //div[@class="team visitor"] into the box.  This expression finds all div elements in the page that have the class "team visitor":


This matches a total of 15 div elements on this page, the first of which is the Duke entry we found above.

When an Xpath returns a list of items, we can pick items out of the list in various ways, including using an index. To pick out the first element of this list, we use (//div[@class="team visitor"])[1]. That gives us the entire block HTML for Duke that I showed earlier. If you look up there, you'll see the team name is within a <a @title="Duke"> tag. We can pull that out by extending our Xpath to say (//div[@class="team visitor"])[1]//a[@title] which essentially says "Give me all the <a> elements with a title attribute that are within the first div element with a class of team visitor". Try that out:


We've now narrowed the Xpath down to just the <a> element containing the team name. We can extract the actual name by appending the function text()to the end of our Xpath. This function returns whatever text it finds inside the element selected by the Xpath:



Here's how we'd use that same Xpath within Web-Harvest to pull out the name and save it in a variable:



You can experiment with creating the Xpaths to pull out the home team's name and the final scores of the game.

Looping


The Xpath example above works on the first element in the list of visitor team names, but what we really want to do is capture the team names and scores for all the games on the page. To do that, we will loop over each of the game sections in turn. Web-Harvest provides a processor for this called <loop>, which works about as you would imagine. It takes a list of elements and loops over them one at a time, and returns a list of the results. Here's the skeleton for looping over each of the games in turn:



The <loop> processor has two parts. The first part is a <list> of items to loop over. The second party is a <body> that will be executed for each element of the list. Each time the <body> is executed, a variable called currGame (which is specified as "item" in the <loop> tag) will be set to the current element of the list. In this case, each <body> execution just returns the current item, so the result of the loop is just the list.

Notice that the <list> of items is given by the Xpath "(//div[contains(@class,'final-state')])". That Xpath returns a list of div elements. There's one div element for each game on the page, and the div has the team names and scores inside of it. (The visiting team name we pulled out earlier is inside this div.)

So now, each time through the loop we need pull out the team names and scores for currGame. currGame contains a chunk of XML, so we can once again use Xpath to do this. Then we'll store each item in its own variable:



Each var-def in the body of the loop uses an Xpath expression to pull out a particular piece of the data. You might want to experiment with the Xpaths to see how each of them finds the right piece of information.

If you run this and look at the value of the loop after it is complete you'll see this:



The value of the loop is a list of all the values of the body as it is executed, and the value of each body is just the list of the values of the processors in the body (four var-def processors in this case). It all gets mashed together and you end up with a long list of team names and scores.

 

Format and Output


To make this more useful, let's clean up the format of the game data and write it out to a file. We can format using the <template> process that we saw last time, and to output we use the same <file> processor we used to read a file. Every time through the loop we'll add a line to the file for the game we just processed:



The ${sys.cr} and ${sys.lf} are Javascript values that put a carriage-return/line-feed at the end of every line. The output file looks like this:




Conclusion


This tutorial should give a general idea of how Web-Harvest works and some of the basic tools it offers for scraping information out of web pages. More help can be found online at the Web-Harvest documentation as well as the Web-Harvest forums.

Here is the completed Web-Harvest script, for cut & paste purposes:

<?xml version="1.0" encoding="UTF-8"?>

<config>
  <var-def name="datestring">
    <file action="read" path="date.txt"></file>
  </var-def>
  <var-def name="webpage">
      <html-to-xml>   
        <http url="http://scores.espn.go.com/ncb/scoreboard?date=20121124"/>
      </html-to-xml>
    </var-def>   
  <loop item="currGame">
      <list>
        <xpath expression="(//div[contains(@class,'final-state')])">
            <var name="webpage"/>
        </xpath>
      </list>
      <body>
          <var-def name="visitor">
              <xpath expression="(//div[@class='team visitor'])[1]//a[@title]/text()">
                  <var name="currGame"/>
              </xpath>
          </var-def>
          <var-def name="visitorScore">
              <xpath expression="(//li[@class='final'])[2]//text()">
                  <var name="currGame"/>
              </xpath>
          </var-def>
          <var-def name="home">
              <xpath expression="(//div[@class='team home'])[1]//a[@title]/text()">
                  <var name="currGame"/>
              </xpath>
          </var-def>
          <var-def name="homeScore">
              <xpath expression="(//li[@class='final'])[3]//text()">
                  <var name="currGame"/>
              </xpath>
          </var-def>
        <file action="append" type="text" path="scores.txt">
           <template>
               ${visitor} ${visitorScore} ${home} ${homeScore} ${sys.cr}${sys.lf}
           </template>
        </file>
      </body>
  </loop>
</config>

4 comments:

  1. Scott
    Thanks. I might finally be able to understand the Web Harvest manual now.

    ReplyDelete
  2. Thanks a lot! A good start for me to learn it:)

    ReplyDelete
  3. Web Scraping Services or website scraping service is like a boon to grow business and reach your business to new heights and success. Website scraping services is nothing but a process of extracting data from website for your business need.

    ReplyDelete
  4. Thanks a lot! This is a very tutorial to start learning WebHarvest. All of a sudden the Web Harvest manual is now more sense.

    ReplyDelete