"Web scraping" is the process of crawling over a web site, downloading web pages intended for human consumption, extracting information, and saving it in a machine-readable format. With the advent of the Web 2.0 and services-based architectures, web scraping has largely fallen into disuse, but it is still required/handy in situations such as this.
There are a number of web scraping tools available, with various functionality and state of repair. Many are frameworks or libraries intended to be embedded in languages like Python. Others are commercial. For my purposes, I wanted a stand-alone, open-source tool with fairly powerful features and a GUI interface. I ended up settling on Web-Harvest. Web-Harvest is written in Java, so it can be run on nearly any platform, and can also be embedded into Java programs. Support is sketchy -- the last release is from 2010, and the documentation is terse -- but there is a moderately active user community.
In the rest of this blog post (or series) I'll describe how to use Web-Harvest as a stand-alone tool to scrape a website. (I won't cover how to use Web-Harvest from inside a Java program.)
Obtaining and Installing Web-HarvestInstalling Web-Harvest is trivial. Download the latest "Single self-executable JAR file" from the website here. This contains a single Jar file. Put that somewhere on your computer and then double-click on the Jar file. Presuming you have Java correctly installed, after a few moments the Web-Harvest GUI will pop up:
Notice that you can download and open some examples. Under the Help menu (or with F1) you'll find the Web-Harvest manual. You can also read this online here.
A Useful Note: Version 2.0 of Web-Harvest has a memory leak bug. This can cause the tool to use up all available memory and hang when downloading and processing a large number of web pages. (Say, a whole season's worth of basketball games :-) You can somewhat minimize this problem by starting Java with a larger memory allocation, using the "-Xms" and "-Xmx" options. How to do this will vary slightly depending upon your operating system and whether things are installed. On my Windows machine, I use a command line that looks something like this:
C:\WINDOWS\system32\javaw.exe -Xms1024m -Xmx1024m -jar "webharvest_all_2.jar"On Windows you can create a shortcut and set the "Target" to be the proper command line. However, even with this workaround, Web-Harvest will eventually hang. The only choice then is to quit and restart.
Initial Set-UpAfter you've downloaded and installed Web-Harvest, there are one or two things you should set before continuing. Open the Web-Harvest GUI as above, and on the Execution menu, select Preferences. This should open a form like this:
First of all, use this form to set an "Output Path". This is the folder (directory) where Web-Harvest will look for input files and write output files. (You can use absolute path names as well, but if you don't, this is where Web-Harvest will try to find things.) There's no way to change this within your Web-Harvest script, so if you need to change this for different scripts, you'll have to remember to do it here first before running your script.
Second, if you need to use a proxy, this is where you can fill in that information.
A Basic Configuration FileThe script that tells Web-Harvest what to do is called a "configuration file" and is written in XML. Using XML might seem like an odd choice, and if you haven't used it before you may find it confusing. But it's very similar to HTML, and is very easy for a machine to read. Since much of what we'll be doing in web scraping is manipulating HTML, there's a sort of symmetry in writing our manipulation scripts in something that is very like HTML.
To get started on your first script (configuration file), click on the blank page icon in the upper left of the Web-Harvest GUI. You should see this:
The right side of this display is your configuration file. Right now it has just a skeleton. You'll fill in your actual screen scraping instructions between the <config> and </config> tags.
Now type the following inside the config tags:
There are three commands (what Web-Harvest calls "processors") in this configuration file: var-def, html-to-xml, and http. Reading these from the inside outwards, this is what they do:
- The innermost processor, http, fetches the web page given in the url attribute -- in this case, the Google home page.
- The next processor, html-to-xml, takes the web page, cleans it up a bit and converts it to XML.
- The last processor, var-def, defines a new Web-Harvest variable named google and gives it the value of the XML returned by the html-to-xml processor.
The top left pane now shows you a tree of all the processors in your script. The bottom pane shows a trace of the execution. Clicking on any of the processors in the top left pane will show you the results of that processor. For example, click on "http " and you'll see this:
The middle left pane now shows you all the results of running that processor. For the http processor, this includes information such as the status code, the header values and so on. You can click the little magnifying glass on the right of each value to see the full value. Of special note is the information named "[Value]" at the bottom of the list -- this is the value that the processor returns to the next processor outwards. If you click on the magnifying glass next to this, you'll get a window showing this:
This is just the HTML for the Google home page -- it's what the http processor fetched from the Web. Try clicking on the "html-to-xml " processor and see how the same web page looks encoded as XML. (Pretty much the same in this case.)
ConclusionSo far I've shown how to get Web-Harvest installed and to create a simple script to download a web page. Next time I'll go into some more detail about how to use the various Web-Harvest features to extract and save information from a web page.
You can continue with Part 2 here.