Reading a FileTo begin with, let's look at how we can read information from a file into Web-Harvest. In this example, I'm going to assume we have a file in our working directory called "date.txt" and that file contains a single line with a date in the format YYYYMMDD, e.g., 20121110. Go to your working directory and create that file. Then open up Web-Harvest, start a new configuration file, and type in this script:
The file processor reads the contents of the "date.txt" file and provides that as a result to the outer processor. In this case, that's the var-def processor that is creating the "datestring" variable. The result is that the datestring variable will be created and its value will be the contents of the date.txt file. To see this, hit the green "Run" arrow, and then examine the datestring variable:
And "Voila!" the datestring variable has a value of "20121110" -- the date that we typed into the date.txt file.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)Jamie Zawinski's famous and generally sound advice notwithstanding, regular expressions are a significant element in the Web-Harvest toolbox. This makes sense -- much of what we do in screen scraping is manipulating text, and regular expressions are very good at that task. It's beyond my interest (and probably, ability) to teach you regular expressions. You'll have to find other resources for that. But I recommend using a regular expression tester like this to help you debug your regular expressions. (Remember that Web-Harvest is implemented in Java, so it uses the Java regular expression syntax.)
For a simple example, we'll use a regular expression to pull the year out of the date we've read from the date.txt file. The year is the first four digits of the date, and digits in regular expressions are represented as \d. Here's the script to use a regular expression to pull the year out of the datestring variable and store it in a new variable called year:
This example introduces a couple of new processors. The first is the regexp processor, which has three parts: the regexp-pattern, the regexp-source, and the regexp-result. The regexp-pattern portion holds the regular expression we're trying to match. In this case, it is the expression "^(\d\d\d\d)" which means "a group of four digits at the beginning of a line". The regexp-source provides the string against which we'll try to match the pattern. In this case, it is the value of the datestring variable, which is the contents of the date.txt file from the previous step of the configuration file. Finally, the regexp-result portion determines what the result of the regular expression will be -- that is, what value it will feed back up to the next processor.
Finally, the regexp processor returns the value of the regexp-result part, and the year variable gets set to "2012". (As you can see in the above screenshot.)
Here's a slightly more complicated example that uses a regular expression to reformat the date in US format. See if you can figure it out:
Fetching a Webpage
Now we'll return to fetching webpages. Look at the following script:
As before, we read in the datestring from a file. And as in Part 1, we use the http processor to fetch a web page. But notice the url:
ConclusionWe now have some basic tools for fetching and manipulating data. Next time we'll get to the real work of pulling information out of a web page.
(You can find Part 3 here.)