Wednesday, February 13, 2013

Web-Harvest Tutorial (Part 2)

I'll pick up this Web-Harvest tutorial from where I left in Part 1.  In the first part, I showed how to install Web-Harvest and use it to create a simple script (configuration file) to download the Google home page.  This time I'll create a more realistic script.

Reading a File

To begin with, let's look at how we can read information from a file into Web-Harvest.  In this example, I'm going to assume we have a file in our working directory called "date.txt" and that file contains a single line with a date in the format YYYYMMDD, e.g., 20121110.  Go to your working directory and create that file.  Then open up Web-Harvest, start a new configuration file, and type in this script:

The file processor reads the contents of the "date.txt" file and provides that as a result to the outer processor.  In this case, that's the var-def processor that is creating the "datestring" variable.  The result is that the datestring variable will be created and its value will be the contents of the date.txt file.  To see this, hit the green "Run" arrow, and then examine the datestring variable:

And "Voila!" the datestring variable has a value of "20121110" -- the date that we typed into the date.txt file.

Regular Expressions

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.  (Jamie Zawinski)
Jamie Zawinski's famous and generally sound advice notwithstanding, regular expressions are a significant element in the Web-Harvest toolbox.  This makes sense -- much of what we do in screen scraping is manipulating text, and regular expressions are very good at that task.  It's beyond my interest (and probably, ability) to teach you regular expressions.  You'll have to find other resources for that.  But I recommend using a regular expression tester like this to help you debug your regular expressions.  (Remember that Web-Harvest is implemented in Java, so it uses the Java regular expression syntax.)

For a simple example, we'll use a regular expression to pull the year out of the date we've read from the date.txt file.   The year is the first four digits of the date, and digits in regular expressions are represented as \d.  Here's the script to use a regular expression to pull the year out of the datestring variable and store it in a new variable called year:

This example introduces a couple of new processors.  The first is the regexp processor, which has three parts: the regexp-pattern, the regexp-source, and the regexp-result.  The regexp-pattern portion holds the regular expression we're trying to match.  In this case, it is the expression "^(\d\d\d\d)" which means "a group of four digits at the beginning of a line".  The regexp-source provides the string against which we'll try to match the pattern.  In this case, it is the value of the datestring variable, which is the contents of the date.txt file from the previous step of the configuration file.  Finally, the regexp-result portion determines what the result of the regular expression will be -- that is, what value it will feed back up to the next processor.

As you can see, inside of regexp-result we have another processor -- template.  Template basically returns whatever is inside of it.  So if you wrote <template>Test</template> the result would simply be the string "Test".  However -- and this is the useful part -- anything enclosed inside ${ } will be evaluated in Javascript and the result of the Javascript will be injected into the template.  So if you wrote <template>Today is the ${sys.datetime("dd")}th</template> you'd get back "Today is the 13th" (or whatever the current day is).

Web-Harvest defines a number of useful variables inside Javascript.  One of these is _1, which is the value of the first matched group in a regular expression.  Because the _1 in our template is enclosed in ${ } it is evaluated in Javascript and is replaced with the first matched group in the regular expression.  So in this case, our template returns "2012".

Finally, the regexp processor returns the value of the regexp-result part, and the year variable gets set to "2012".  (As you can see in the above screenshot.)

Here's a slightly more complicated example that uses a regular expression to reformat the date in US format.  See if you can figure it out:

Fetching a Webpage

Now we'll return to fetching webpages.  Look at the following script:

As before, we read in the datestring from a file.  And as in Part 1, we use the http processor to fetch a web page.  But notice the url:
The end of the URL is "${datestring}".  In processor attributes, just as in the template processor, anything enclosed in ${ } is evaluated in Javascript.  In this case, "${datestring}" is replaced with the value of the datestring variable -- which is the "20121110" we read from the date.txt file.  So the resulting URL is "".  This leads (as you might have guessed) to the college basketball results from 11/10/2012.


We now have some basic tools for fetching and manipulating data.  Next time we'll get to the real work of pulling information out of a web page.

(You can find Part 3 here.)


  1. Thank you so much!

    My date.txt file is not recording any data from running the scripts in the tutorial. When that didn't work, I tried making another date.txt file with an explicit path to the desktop, c:\users\MYUSERNAME\desktop\date.txt. The script still didn't write to the txt file.

    What do you suggest?


    1. "date.txt" is actually an input file, not an output file. It should have one line in it that reads


      Sorry if that wasn't clear. The output comes in Part 3.

  2. it will be very helpfull if you could put the conf xml scrpts as plain text or download into the lessons... in image format is very time consuming copyng it when tinhs became more complex.

  3. Loginworks Softwares provides web scraping, data scraping, website scraping, web data extraction, big data service, big data solution and data mining services. We provides any kind of data from any online web resource.

  4. great tutorial! It's really hard to find a web-harvest tutorial so detailed as this one. Thanks!