Net Prophet: February 2013

Monday, February 25, 2013

Comparative Tournament Odds

A reader ("CS", who is apparently a cat) was curious how the PM's ratings compared to the Vegas odds for winning the tournament. I put together a very quick, very hacky comparison, and here are the top 25:

Team	Odds	PM Odds	Ratio
Indiana	400	383	104%
Florida	600	328	183%
Duke	700	4314	16%
Michigan	700	1828	38%
Gonzaga	800	1976	40%
Miami FL	800	19018	4%
Kansas	1000	1976	51%
Louisville	1000	2498	40%
Michigan St	1000	6785	15%
Georgetown	1500	10104	15%
Syracuse	1500	3690	41%
St Louis	2500	11720	21%
Ohio St	2500	3413	73%
Arizona	2500	11812	21%
New Mexico	4000	83840	5%
Missouri	5000	25388	20%
Kansas St	5000	10589	47%
Oklahoma St	5000	7055	71%
Wisconsin	6000	2920	205%
North Carolina	6000	20244	30%
North Carolina St	6000	20244	30%
UCLA	6000	36645	16%
Pittsburgh	6000	7279	82%
Butler	7500	44197	17%

Likely the whole comparison doesn't mean anything, but the ratio might say something about which teams the PM thinks are undervalued or overvalued. This assumes that Indiana is more-or-less correctly valued. Not unsurprisingly, the PM thinks Florida is undervalued, Wisconsin even more so, and teams like tOSU, Oklahoma State and Pittsburgh as well. Very overvalued are Miami and New Mexico, and significantly over-valued includes Duke, MSU, Georgetown, UCLA and Butler. In the latter you can see a common thread of teams where bettors might be over-enthusiastic because of the popularity of the program, recent performance, etc.

Saturday, February 23, 2013

Prediction Recap

#2 Miami (FL) @ Wake Forest: Miami by 7

Miami gets pummeled by Wake Forest, losing by 15. Even though Miami was overrated at #2, this is an inexplicable loss. Wake Forest is not a terrible team, but this is the sort of game you need to win to compete with the Dukes and North Carolinas. You certainly don't want to be blown out. At 13-1, Miami remains in control of the ACC race, but they'll likely lose at Duke, so dropping another game would be disastrous.

San Diego @ #3 Gonzaga: Gonzaga by 24

Gonzaga at least holds true to form, winning by 31.

Arkansas @ #5 Florida: Florida by 17

Florida wins by ... 17. A solid performance. At some point the AP voters will figure out that Florida is really strong.

#11 Georgetown @ #8 Syracuse: Syracuse by 10

Georgetown has been improving over the last half of the season, but beating Syracuse at home by 11 points is still very impressive.

TCU @ #9 Kansas: Kansas by 25

Kansas by ... 26. TCU didn't put up much of a struggle.

Seton Hall @ #10 Louisville: Louisville by 17

Louisville by ... 18.

On the undercard, both teams that vaulted into the Prediction Machine's Top Twenty this week won big games. St. Louis beat #15 Butler to sweep the season series. St. Louis lost the first half, so the PM won't be overly impressed, but the win should keep St. Louis solidly in the Top Twenty. Kansas State beat Texas -- not a tough task -- to keep on track for a season-ending showdown with Oklahoma State which may decide the Big-12 regular season champion.

Friday, February 22, 2013

Top Twenty (2/22)

Ranking	Delta	Team	Rating
1		Florida	135
2		Indiana	133
3		Michigan	114
4	+3	Gonzaga	111
5	-1	Kansas	111
6	-1	Louisville	107
7	+2	Wisconsin	107
8		Syracuse	106
9	-3	Ohio St.	105
10	+1	Duke	101
11	+3	Michigan St.	96
12		Oklahoma St.	95.1
13	-3	Pittsburgh	94.9
14	+2	Iowa	93
15	NR	Saint Louis	90.4
16	NR	Kansas St.	89.5
17	+2	Georgetown	89.4
18	-5	Minnesota	89.4
19	-2	Miami (FL)	88.5

No change in the Top Three from last week, which surprises me a bit. Indiana performed about as expected, while Florida over-performed against Auburn and under-performed against Missouri, with the result that they treaded water at the top of the rankings. (Michigan State moved up in large part by outplaying Indiana in the 2nd half of the loss.) The Billikens, who were hanging just off the bottom of the Top Twenty last week, vaulted all the way to #15 based on strong wins against Charlotte and VCU. St. Louis gets Butler tonight in an important A-10 matchup. The PM favors Butler by a mere 1.5, so it should be a good game. The biggest surprise move is KSU, vaulting upwards after wins over mediocres WVU and Baylor.

Still a big gap between Florida/Indiana and the rest of the field. They could well meet up in the Championship game if they start in opposite halves of the bracket.

AP Top Ten Matchups this Saturday:

#2 Miami (FL) @ Wake Forest: Miami by 7
San Diego @ #3 Gonzaga: Gonzaga by 24
Arkansas @ #5 Florida: Florida by 17
#11 Georgetown @ #8 Syracuse: Syracuse by 10
TCU @ #9 Kansas: Kansas by 25
Seton Hall @ #10 Louisville: Louisville by 17

It looks to be a day of blowouts, although the Miami game might be interesting. Georgetown-Syracuse is being touted as the big matchup, but it's likely not going to be that competitive. Syracuse is significantly better than Georgetown and playing at home, and the combination is likely too much for the Hoyas to overcome.

Wednesday, February 20, 2013

Predictive Ratings vs. Achievement Ratings

In this recent posting, I commented that the AP voters were continuing to under-rank Florida. The PM ranked Florida #1 (in a virtual tie with Indiana), while the AP voters had them at #7. In a comment on that posting (and in his own blog postings here and here), Monte McNair makes a distinction between predictive ratings and achievement ratings. The former are measured by how well they predict future games; the latter by how well they reflect what teams have accomplished.

The question I want to ponder for a moment is whether that's a meaningful distinction.

As a mental exercise, imagine that we decided to create a rating system for the sport of competitive lifting. Based upon how much weight competitors lifted in various competitions, we'd assign them a rating that would reflect their weightlifting ability. What would this rating represent? Most people would say that it represents (or measures) the "strength" of the competitor.

That seems like a silly exercise for weightlifting, because we already have a direct measure of the competitors' strength -- how much they lifted. It makes more sense for sports like basketball, where we understand that the final score doesn't directly measure a team's "basketball strength" but is instead a complicated function of the two teams' strengths and factors like the officiating crew, the venue, and so on. The rating is intended to tease out the hidden variable -- the team's basketball strength -- which cannot be measured directly. So a rating represents a team's basketball strength, and usually a higher number represents more strength.

Now let's return to the distinction between predictive ratings and achievement ratings.

There's an easy and intuitive understanding of how to assess a predictive rating: We test it's ability to predict future games. A rating that does that better is a better predictive rating. The achievement rating looks backward rather than forward, so we should assess an achievement rating by how well it predicts past games. A rating that does that better is a better achievement rating.

Here's the rub, though: those are the same things! The more accurately a rating reflects the true "basketball strength" of a team, the better it will perform predicting all of the team's games -- whether they have already occurred or are in the future.

Monte also argues in this posting that:

When assessing how well a team has played over a season, the only factors that should come into play are: (1) how often did you win and (2) how difficult was your schedule.

I think there's a simple counter-example to this notion. Imagine that going into the last game of conference play, Indiana and Michigan have played exactly the same schedule of opponents, and they're about to play each other. They each played Butler in the third game of the season, but I won't tell you how that game came out. In all the other games, Indiana beat each opponent by at least 12 points, while Michigan never won by more than 6 points and went to OT in three of the games.

Now I ask you two questions: (1) Who was more likely to have won against Butler when they played early in the season? and (2) Who is more likely to win when they play each other tonight on a neutral floor?

My guess is that almost everyone would answer Indiana to both questions -- which means that Indiana should be rated higher than Michigan. Regardless of whether you're trying assess what a team has already achieved or how it might perform in a future game, how a team wins (or loses) a game is very important.

Of course, you may reject the notion that ratings should reflect a team's "basketball strength". But then I challenge you to express clearly what a rating should mean. I think you'll find it very hard to find a meaningful definition that doesn't come back to being an accurate measure of a team's strength.

Sunday, February 17, 2013

Prediction Review

How did the PM do this Saturday?

#1 Indiana by 19 over Purdue

Indiana actually won by 28.

#2 Duke by 5 over Maryland

Gary Williams is gone, so don't expect one of his patented Maryland surprises.

Fear the Turtle! Gary Williams' ghost (what, he isn't dead?) must have inspired the Terps, because they handed Duke a patented unpleasant surprise at what might be the last Duke-Maryland matchup at Comcast Center.

#5 Gonzaga by 12.5 over San Francisco

Gonzaga on cruise control.

Gonzaga by 10

#6 Syracuse by 8 over Seton Hall

Syracuse by 11

#7 Florida by 16 over Auburn

Florida remains amazingly under-ranked and Auburn is turrible.

The actual margin was 31. Hopefully the AP voters will take notice.

#8 MSU by 9 over Nebraska

Nebraska also turrible.

MSU by 9.

#10 KSU by 4 over Baylor

A good chance for an upset in this game -- KSU is very over-ranked in the polls.

KSU by 20. KSU controlled the second half and throttled Baylor in what is a very nice result for them.

So not a terrible day for the PM. It missed one upset and got several games within one basket.

Saturday, February 16, 2013

The Effect of Overtime Games

Notre Dame just played an OT game against DePaul, just four days after a marathon 4 OT game against Louisville, and their 3rd OT in the last five games. This prompted the question of whether playing OT games "tires out" teams and affects their performance in subsequent games.

There are various ways you might look at this question, but an easy one is to look at scoring averages for teams after they play an OT game and see if they differ significantly from the scoring average when they haven't just played an OT game. Ignoring neutral court games, I looked at how many points home teams scored and gave up for both cases:

	No OT	After OT
Home Score	70.3	70.2
Away Score	65.3	66

As you can see, there's a mild effect, particularly on the defensive side of the ball, which totals to about 1 point difference in MOV. There are about 1200 games in my database that meet the "home team's last game went to OT" criteria, so this isn't a huge sample.

For what it's worth, since 2009 there have been 11 Tournament games where the home team's previous game went to OT. In those games, the home team averaged 69.5 points and gave up 65.5 points.

Friday, February 15, 2013

On Florida and Indiana

Florida and Indiana have been at the top of the PM's power rankings since the beginning of the year, and in recent weeks have opened a significant lead (about 20 points) on the rest of the field. In the polls Indiana has been in the top two or three consistently, but Florida languishes down in the lower half of the Top Ten. Meanwhile, Miami (which has an identical 19-3 record against weaker opponents) has bounced up to #3.

No matter. Polls are mostly entertainment, and the Tournament will crown a de facto champion. But Matt Woods over at TeamRankings.com has an interesting piece putting Indiana and Florida in a historical perspective:

As of February 15th, 2013, Indiana and Florida lead the nation in average scoring margin with +21.7 and +21.3, respectively. The next closest team is Pittsburgh with +16.6. To put those numbers in perspective, the last team to have a higher average scoring margin at this point of the season was 2001 Duke (+22.6).

Scoring margin alone isn't a good predictor of strength because it doesn't account for strength of schedule. But Indiana and Florida are playing very good opponents,so their scoring margin is significant -- just more evidence that these two teams really are the cream of the crop this year.

Top Twenty (2/15)

I'm on a tight schedule this evening, so this will be a rushed posting. Here's this week's Top Twenty:

Rank	Team	Score
1	Florida	134
2	Indiana	133
3	Michigan	117
4	Kansas	110
5	Louisville	107
6	Ohio St.	107
7	Gonzaga	105
8	Syracuse	102
9	Wisconsin	100
10	Pittsburgh	100
11	Duke	99.4
12	Oklahoma St.	96.3
13	Minnesota	96.1
14	Michigan St.	94.9
15	Cincinnati	91.2
16	Iowa	89.4
17	Miami (FL)	89
18	Arizona	87.3
19	Georgetown	85.9
20	Memphis	85.5

I don't have time to compare closely to last week but some interesting movement.

And some Top Twenty predictions for the Saturday games:

#1 Indiana by 19 over Purdue

Likely to be an unholy thrashing.

#2 Duke by 5 over Maryland

Gary Williams is gone, so don't expect one of his patented Maryland surprises.

#5 Gonzaga by 12.5 over San Francisco

Gonzaga on cruise control.

#6 Syracuse by 8 over Seton Hall

#7 Florida by 16 over Auburn

Florida remains amazingly under-ranked and Auburn is turrible.

#8 MSU by 9 over Nebraska

Nebraska also turrible.

#10 KSU by 4 over Baylor

A good chance for an upset in this game -- KSU is very over-ranked in the polls.

Web-Harvest Tutorial (Part 3)

I left off last time with a script that would read a date from a file and then fetch the ESPN Scoreboard web page for that date:

Today I'll show how to pull information out of the webpage and save it -- specifically, we'll pull the teams and scores out of the page and save them off for later use.

Find the Information

The first step is to figure out where the information we want is in the web page. This is easier said than done on modern web pages, which tend to be impenetrable morasses of Javascript, HTML and CSS. One way to get started is to use the "View Source" option on your web browser (or save the web page onto your computer and view it with your favorite text editor) and then search for text you can see from the web page. For example, if we go the ESPN Scoreboard page for 11/24/2012, we can see that the first listed game is Duke versus Louisville. If we do "View Source" and search for "Duke", we find this as the first reference:

<a title="Duke" href="http://espn.go.com/mens-college-basketball/team/_/id/150/duke-blue-devils">Duke</a>

And, this in fact, is the HTML code that creates the "Duke" text in the Duke vs. Louisville scoreboard. With some more digging, re-formating and so on, we can eventually see that the information about Duke is in an HTML structure that looks like this:

<div class="team visitor">
<div class="team-capsule">
      <span id="323290097-aTeamName">
    <a title="Duke">
    Duke</a>
      </span>
</div>
<ul id="323290097-aScores" class="score" style="display:block">
    <li class="final" id="323290097-awayHeaderScore">
      76</li>
</ul>
</div>

The information about Louisville is in a structure that is identical except that it starts with "team home" instead of "team visitor."

So now that we know where the information is, we need to pluck it out and put it to use.

The Power of XPath

Web-Harvest uses Xpath extensively to dig information out of webpages. Xpath is a notation for specifying where to find something in an XML file. It's a "path" from the top-level of the XML down to some particular piece (or pieces) of the XML. It's very useful and very powerful, but like regular expressions can be confusing and difficult to use. If you don't know anything about XPath, you might want to go off and read a tutorial about it to familiarize yourself with how it works. It's also very useful to have an XPath tester for working out the correct paths for the information you're trying to get.

In fact, Web-Harvest itself provides a very handy XPath tester. To see it's use, run the above script to fetch the ESPN page, and then use the left-hand pane to see the value of the "webpage" variable (also as shown above). Now click on the magnifier icon to the right of the "[Value]" box and you'll get a pop-up window showing the text of the webpage:

Notice the "View as:" option in the top left of the pop-up. Click here and select XML. This will show the webpage in XML format:

This view has a couple of handy features. First, you can use the "Pretty-Print" button at the top to reorganize and cleanup the XML for easier viewing. Second, you'llsee a box at the bottom labeled "XPath expression." If you type an Xpath into this box, Web-Harvest will run that XPath against the displayed XML and show the result. For example, try typing the Xpath //div[@class="team visitor"] into the box. This expression finds all div elements in the page that have the class "team visitor":

This matches a total of 15 div elements on this page, the first of which is the Duke entry we found above.

When an Xpath returns a list of items, we can pick items out of the list in various ways, including using an index. To pick out the first element of this list, we use (//div[@class="team visitor"])[1]. That gives us the entire block HTML for Duke that I showed earlier. If you look up there, you'll see the team name is within a <a @title="Duke"> tag. We can pull that out by extending our Xpath to say (//div[@class="team visitor"])[1]//a[@title] which essentially says "Give me all the <a> elements with a title attribute that are within the first div element with a class of team visitor". Try that out:

We've now narrowed the Xpath down to just the <a> element containing the team name. We can extract the actual name by appending the function text()to the end of our Xpath. This function returns whatever text it finds inside the element selected by the Xpath:

Here's how we'd use that same Xpath within Web-Harvest to pull out the name and save it in a variable:

You can experiment with creating the Xpaths to pull out the home team's name and the final scores of the game.

Looping

The Xpath example above works on the first element in the list of visitor team names, but what we really want to do is capture the team names and scores for all the games on the page. To do that, we will loop over each of the game sections in turn. Web-Harvest provides a processor for this called <loop>, which works about as you would imagine. It takes a list of elements and loops over them one at a time, and returns a list of the results. Here's the skeleton for looping over each of the games in turn:

The <loop> processor has two parts. The first part is a <list> of items to loop over. The second party is a <body> that will be executed for each element of the list. Each time the <body> is executed, a variable called currGame (which is specified as "item" in the <loop> tag) will be set to the current element of the list. In this case, each <body> execution just returns the current item, so the result of the loop is just the list.

Notice that the <list> of items is given by the Xpath "(//div[contains(@class,'final-state')])". That Xpath returns a list of div elements. There's one div element for each game on the page, and the div has the team names and scores inside of it. (The visiting team name we pulled out earlier is inside this div.)

So now, each time through the loop we need pull out the team names and scores for currGame. currGame contains a chunk of XML, so we can once again use Xpath to do this. Then we'll store each item in its own variable:

Each var-def in the body of the loop uses an Xpath expression to pull out a particular piece of the data. You might want to experiment with the Xpaths to see how each of them finds the right piece of information.

If you run this and look at the value of the loop after it is complete you'll see this:

The value of the loop is a list of all the values of the body as it is executed, and the value of each body is just the list of the values of the processors in the body (four var-def processors in this case). It all gets mashed together and you end up with a long list of team names and scores.

Format and Output

To make this more useful, let's clean up the format of the game data and write it out to a file. We can format using the <template> process that we saw last time, and to output we use the same <file> processor we used to read a file. Every time through the loop we'll add a line to the file for the game we just processed:

The ${sys.cr} and ${sys.lf} are Javascript values that put a carriage-return/line-feed at the end of every line. The output file looks like this:

Conclusion

This tutorial should give a general idea of how Web-Harvest works and some of the basic tools it offers for scraping information out of web pages. More help can be found online at the Web-Harvest documentation as well as the Web-Harvest forums.

Here is the completed Web-Harvest script, for cut & paste purposes:

<?xml version="1.0" encoding="UTF-8"?>

<config>
<var-def name="datestring">
    <file action="read" path="date.txt"></file>
</var-def>
<var-def name="webpage">
    <html-to-xml>
        <http url="http://scores.espn.go.com/ncb/scoreboard?date=20121124"/>
    </html-to-xml>
    </var-def>
<loop item="currGame">
    <list>
        <xpath expression="(//div[contains(@class,'final-state')])">
            <var name="webpage"/>
        </xpath>
    </list>
    <body>
        <var-def name="visitor">
            <xpath expression="(//div[@class='team visitor'])[1]//a[@title]/text()">
                <var name="currGame"/>
            </xpath>
        </var-def>
        <var-def name="visitorScore">
            <xpath expression="(//li[@class='final'])[2]//text()">
                <var name="currGame"/>
            </xpath>
        </var-def>
        <var-def name="home">
            <xpath expression="(//div[@class='team home'])[1]//a[@title]/text()">
                <var name="currGame"/>
            </xpath>
        </var-def>
        <var-def name="homeScore">
            <xpath expression="(//li[@class='final'])[3]//text()">
                <var name="currGame"/>
            </xpath>
        </var-def>
        <file action="append" type="text" path="scores.txt">
           <template>
               ${visitor} ${visitorScore} ${home} ${homeScore} ${sys.cr}${sys.lf}
           </template>
        </file>
    </body>
</loop>
</config>

Wednesday, February 13, 2013

Web-Harvest Tutorial (Part 2)

I'll pick up this Web-Harvest tutorial from where I left in Part 1. In the first part, I showed how to install Web-Harvest and use it to create a simple script (configuration file) to download the Google home page. This time I'll create a more realistic script.

Reading a File

To begin with, let's look at how we can read information from a file into Web-Harvest. In this example, I'm going to assume we have a file in our working directory called "date.txt" and that file contains a single line with a date in the format YYYYMMDD, e.g., 20121110. Go to your working directory and create that file. Then open up Web-Harvest, start a new configuration file, and type in this script:

The file processor reads the contents of the "date.txt" file and provides that as a result to the outer processor. In this case, that's the var-def processor that is creating the "datestring" variable. The result is that the datestring variable will be created and its value will be the contents of the date.txt file. To see this, hit the green "Run" arrow, and then examine the datestring variable:

And "Voila!" the datestring variable has a value of "20121110" -- the date that we typed into the date.txt file.

Regular Expressions

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

Jamie Zawinski's famous and generally sound advice notwithstanding, regular expressions are a significant element in the Web-Harvest toolbox. This makes sense -- much of what we do in screen scraping is manipulating text, and regular expressions are very good at that task. It's beyond my interest (and probably, ability) to teach you regular expressions. You'll have to find other resources for that. But I recommend using a regular expression tester like this to help you debug your regular expressions. (Remember that Web-Harvest is implemented in Java, so it uses the Java regular expression syntax.)

For a simple example, we'll use a regular expression to pull the year out of the date we've read from the date.txt file. The year is the first four digits of the date, and digits in regular expressions are represented as \d. Here's the script to use a regular expression to pull the year out of the datestring variable and store it in a new variable called year:

This example introduces a couple of new processors. The first is the regexp processor, which has three parts: the regexp-pattern, the regexp-source, and the regexp-result. The regexp-pattern portion holds the regular expression we're trying to match. In this case, it is the expression "^(\d\d\d\d)" which means "a group of four digits at the beginning of a line". The regexp-source provides the string against which we'll try to match the pattern. In this case, it is the value of the datestring variable, which is the contents of the date.txt file from the previous step of the configuration file. Finally, the regexp-result portion determines what the result of the regular expression will be -- that is, what value it will feed back up to the next processor.

As you can see, inside of regexp-result we have another processor -- template. Template basically returns whatever is inside of it. So if you wrote <template>Test</template> the result would simply be the string "Test". However -- and this is the useful part -- anything enclosed inside ${ } will be evaluated in Javascript and the result of the Javascript will be injected into the template. So if you wrote <template>Today is the ${sys.datetime("dd")}th</template> you'd get back "Today is the 13th" (or whatever the current day is).

Web-Harvest defines a number of useful variables inside Javascript. One of these is _1, which is the value of the first matched group in a regular expression. Because the _1 in our template is enclosed in ${ } it is evaluated in Javascript and is replaced with the first matched group in the regular expression. So in this case, our template returns "2012".

Finally, the regexp processor returns the value of the regexp-result part, and the year variable gets set to "2012". (As you can see in the above screenshot.)

Here's a slightly more complicated example that uses a regular expression to reformat the date in US format. See if you can figure it out:

Fetching a Webpage

Now we'll return to fetching webpages. Look at the following script:

As before, we read in the datestring from a file. And as in Part 1, we use the http processor to fetch a web page. But notice the url:

url="http://scores.espn.go.com/ncb/scoreboard?date=${datestring}"

The end of the URL is "${datestring}". In processor attributes, just as in the template processor, anything enclosed in ${ } is evaluated in Javascript. In this case, "${datestring}" is replaced with the value of the datestring variable -- which is the "20121110" we read from the date.txt file. So the resulting URL is "http://scores.espn.go.com/ncb/scoreboard?date=20121110". This leads (as you might have guessed) to the college basketball results from 11/10/2012.

Conclusion

We now have some basic tools for fetching and manipulating data. Next time we'll get to the real work of pulling information out of a web page.

(You can find Part 3 here.)

Tuesday, February 12, 2013

Another Tool: WebHarvest Tutorial (Part 1)

One of the more annoying tasks required to predict college basketball games is collecting the data about past games, both from the current season (to characterize the teams this year) as well as previous seasons (to train our prediction algorithms). Although this information is ubiquitous on the Internet (ESPN, Yahoo Sports, the NCAA, and others all offer it) it cannot be found in a form that is easy to use and obtain. Although it seems very "last decade", about the best you can do is to laboriously scrape the data from one of the sites that offers it.

"Web scraping" is the process of crawling over a web site, downloading web pages intended for human consumption, extracting information, and saving it in a machine-readable format. With the advent of the Web 2.0 and services-based architectures, web scraping has largely fallen into disuse, but it is still required/handy in situations such as this.

There are a number of web scraping tools available, with various functionality and state of repair. Many are frameworks or libraries intended to be embedded in languages like Python. Others are commercial. For my purposes, I wanted a stand-alone, open-source tool with fairly powerful features and a GUI interface. I ended up settling on Web-Harvest. Web-Harvest is written in Java, so it can be run on nearly any platform, and can also be embedded into Java programs. Support is sketchy -- the last release is from 2010, and the documentation is terse -- but there is a moderately active user community.

In the rest of this blog post (or series) I'll describe how to use Web-Harvest as a stand-alone tool to scrape a website. (I won't cover how to use Web-Harvest from inside a Java program.)

Obtaining and Installing Web-Harvest

Installing Web-Harvest is trivial. Download the latest "Single self-executable JAR file" from the website here. This contains a single Jar file. Put that somewhere on your computer and then double-click on the Jar file. Presuming you have Java correctly installed, after a few moments the Web-Harvest GUI will pop up:

Notice that you can download and open some examples. Under the Help menu (or with F1) you'll find the Web-Harvest manual. You can also read this online here.

A Useful Note: Version 2.0 of Web-Harvest has a memory leak bug. This can cause the tool to use up all available memory and hang when downloading and processing a large number of web pages. (Say, a whole season's worth of basketball games :-) You can somewhat minimize this problem by starting Java with a larger memory allocation, using the "-Xms" and "-Xmx" options. How to do this will vary slightly depending upon your operating system and whether things are installed. On my Windows machine, I use a command line that looks something like this:

C:\WINDOWS\system32\javaw.exe -Xms1024m -Xmx1024m -jar "webharvest_all_2.jar"

On Windows you can create a shortcut and set the "Target" to be the proper command line. However, even with this workaround, Web-Harvest will eventually hang. The only choice then is to quit and restart.

Initial Set-Up

After you've downloaded and installed Web-Harvest, there are one or two things you should set before continuing. Open the Web-Harvest GUI as above, and on the Execution menu, select Preferences. This should open a form like this:

First of all, use this form to set an "Output Path". This is the folder (directory) where Web-Harvest will look for input files and write output files. (You can use absolute path names as well, but if you don't, this is where Web-Harvest will try to find things.) There's no way to change this within your Web-Harvest script, so if you need to change this for different scripts, you'll have to remember to do it here first before running your script.

Second, if you need to use a proxy, this is where you can fill in that information.

A Basic Configuration File

The script that tells Web-Harvest what to do is called a "configuration file" and is written in XML. Using XML might seem like an odd choice, and if you haven't used it before you may find it confusing. But it's very similar to HTML, and is very easy for a machine to read. Since much of what we'll be doing in web scraping is manipulating HTML, there's a sort of symmetry in writing our manipulation scripts in something that is very like HTML.

To get started on your first script (configuration file), click on the blank page icon in the upper left of the Web-Harvest GUI. You should see this:

The right side of this display is your configuration file. Right now it has just a skeleton. You'll fill in your actual screen scraping instructions between the <config> and </config> tags.

Now type the following inside the config tags:

There are three commands (what Web-Harvest calls "processors") in this configuration file: var-def, html-to-xml, and http. Reading these from the inside outwards, this is what they do:

The innermost processor, http, fetches the web page given in the url attribute -- in this case, the Google home page.
The next processor, html-to-xml, takes the web page, cleans it up a bit and converts it to XML.
The last processor, var-def, defines a new Web-Harvest variable named google and gives it the value of the XML returned by the html-to-xml processor.

To see this in action, click the green "Run" arrow near the top of the GUI. Web-Harvest will whir through the script and give you a message that "Configuration 'Config 1' has finished execution." Click OK and you'll see this:

The top left pane now shows you a tree of all the processors in your script. The bottom pane shows a trace of the execution. Clicking on any of the processors in the top left pane will show you the results of that processor. For example, click on "http [1]" and you'll see this:

The middle left pane now shows you all the results of running that processor. For the http processor, this includes information such as the status code, the header values and so on. You can click the little magnifying glass on the right of each value to see the full value. Of special note is the information named "[Value]" at the bottom of the list -- this is the value that the processor returns to the next processor outwards. If you click on the magnifying glass next to this, you'll get a window showing this:

This is just the HTML for the Google home page -- it's what the http processor fetched from the Web. Try clicking on the "html-to-xml [1]" processor and see how the same web page looks encoded as XML. (Pretty much the same in this case.)

Conclusion

So far I've shown how to get Web-Harvest installed and to create a simple script to download a web page. Next time I'll go into some more detail about how to use the various Web-Harvest features to extract and save information from a web page.

You can continue with Part 2 here.

Sunday, February 10, 2013

Predictions Check

#2 Florida by 25 over Mississippi State

Florida won by 25.

Wisconsin by 1.5 over #3 Michigan

Wisconsin isn't ranked but should be. This game is at Wisconsin.

Wisconsin won by 3 in OT. Did the PM call the correct winner in this game? Ultimately, yes. But you could argue that with the game tied at the end of regulation, the prediction was off by -1.5 points, instead of +1.5 points.

#5 Kansas by 6 over Oklahoma.

Oklahoma by 6. Maybe I got the sign on the prediction wrong :-). I didn't see any of the game, but not a good loss.

#6 Gonzaga by 23 over Loyola-Marymount

Gonzaga is over-ranked at #6, but LMU is not good.

Gonzaga by 19.

#8 Miami (FL) by 3 over UNC

This is basically a coin-flip game. The polls will be expecting Miami to win and will punish them if they lose, but this would actually be a very solid win for them.

Miami won by 26 thanks to an onslaught of threes. There's no way in a predictor to account for something like Miami hitting 15 of 26 threes. Coach Larranaga (a personal favorite) is setting all sorts of records at Miami this season, including becoming the only team in the ACC outside of North Carolina to start the conference season 10-0, but talk of a #1 ranking is premature. After Saturday's games, the PM has Miami at #17. (Sorry, Jay Bilas!) The ACC is weaker than usual this year, and I'm dubious that Miami is in the same class as Florida/Indiana/Michigan. That said, I'm looking forward to seeing if Coach L can make some noise in the Tournament.

#11 Louisville by 5 over #25 Notre Dame

With 50 seconds to go, Louisville was up 5 on Notre Dame, and I tuned in to confirm the PM's prediction. Then Jarrod Grant hit three straight threes and an old-fashioned three-point play to send the game to OT. Five OTs (!) later, Notre Dame won by 3. The Prediction Machine can handle games up to 7OT -- I figured one more than the Syracuse-UConn marathon would be sufficient -- but I was a little worried last night that I would have to extend the code.

The treatment of OT games -- and particularly whether they should be treated as "ties" -- is interesting and I'll probably do a post on that soon.

Saturday, February 9, 2013

Top Twenty (2/8)

Rank	Change	Team	Score
1	+1	Indiana	131
2	-1	Florida	130
3		Michigan	126
4	+1	Ohio St.	110
5	-1	Kansas	110
6	+2	Louisville	109
7		Syracuse	105
8	+4	Duke	105
9	-3	Gonzaga	104
10	+1	Pittsburgh	101
11	+2	Wisconsin	98.2
12	-2	Minnesota	97.8
13	+2	Arizona	93.7
14	+4	Oklahoma St.	92.3
15	-6	Creighton	90.2
16	+1	Iowa	90.1
17	-3	Cincinnati	89.6
18	-2	Kansas St.	88.7
19		VCU	87.4
20	+1	Michigan St.	86.7

Some shuffling of the deck chairs up at the top of the rankings, but thanks to the #1 teams swapping losses every week no one has really plummeted. But absolute strength scores took a hit -- from 144 at the top last week to only 131 this week. Duke (averaging +14 points over the last four games) and Oklahoma State (thanks to the win over #5 Kansas) both surge upwards, while Creighton plunges thanks to three losses in the last six games.

Predictions for Saturday's Top Ten games:

#2 Florida by 25 over Mississippi State
Wisconsin by 1.5 over #3 Michigan

Wisconsin isn't ranked but should be. This game is at Wisconsin.

#5 Kansas by 6 over Oklahoma.
#6 Gonzaga by 23 over Loyola-Marymount

Gonzaga is over-ranked at #6, but LMU is not good.

#8 Miami (FL) by 3 over UNC

This is basically a coin-flip game. The polls will be expecting Miami to win and will punish them if they lose, but this would actually be a very solid win for them.

#11 Louisville by 5 over #25 Notre Dame

Monday, February 4, 2013

Prediction Results

A follow-up to Saturday's predictions. Individual results don't mean much, but it's fun to look at them once in a while:

#3 Indiana by 8 over #1 Michigan

The line on this game was surprisingly low (4.5) maybe because of partisan betting, or because people don't understand HCA. Anyway, the PM was dead-on, and Indiana won by 8.

#2 Kansas by 9.5 over Oklahoma St.

Oklahoma St. by 5 - a bad loss for Kansas.

#4 Florida by 15 over #16 Ole Miss

Another good call by the PM: Florida by 14. The PM I think rightly rates Florida much higher than the polls.

#5 Duke by 7 over Florida State

Duke by 19.

Pittsburgh by 2.5 over #6 Syracuse

Pittsburgh by 10 -- the PM gets the "upset" correct. This will be viewed as a bad loss for Syracuse, but really shouldn't have been surprising.

#7 Gonzaga by 18 over San Diego

Gonzaga by 2, with a last-second shot required to avoid what would have been a big upset.

#8 Arizona by 8 over Washington St

Arizona by 14.

#9 Butler by 15 over Rhode Island

Butler by 7. Not a good showing at home against 4-10 Rhode Island.

#10 Oregon by 3 over California

California by 4.

#11 tOSU by 13 over Nebraska

tOSU by 7.

#19 NC State by 3 over #14 Miami (FL)

Miami (FL) by 1. The PM almost calls what would have been a big upset. (It did get the game right against the line, at least.) Coach Larranaga is doing a nice job in his second year.

#15 Wichita State by 4 over UNI

UNI by 5. A nice win for UNI, who has been inconsistent this season, but sometimes very good.

#17 Missouri by 19 over Auburn

Missouri by 14. Close enough :-)

Oklahoma by 1/2 over #18 Kansas St

Kansas State by 2. This looked to be the closest game in the Top Twenty and delivered.

#20 New Mexico by 13 over Nevada

New Mexico by 20. The PM rounds out the Top Twenty with another bullseye.

So overall the PM went 10-5 and called 5 games within 5 points.

Friday, February 1, 2013

PM Top Twenty (2/1)

#	Move	Team	Score
1		Florida	144
2	+1	Indiana	132
3	+1	Michigan	130
4	-2	Kansas	124
5		Ohio St.	112
6	+3	Gonzaga	106
7	-1	Syracuse	106
8	-1	Louisville	105
9	-1	Creighton	102
10		Minnesota	100
11	+3	Pittsburgh	99.95
12	-1	Duke	99.3
13	-1	Wisconsin	98.1
14	-1	Cincinnati	95.1
15	+3	Arizona	93.4
16	+4	Kansas St.	91.8
17	-2	Iowa	89.7
18	-1	Oklahoma St.	89.1
19	-3	VCU	87.4
20	NR	N.C. State	85.9

Florida remains in the #1 spot this week, with 30 point wins this week over Mississippi State and South Carolina. A bigger test comes up on Saturday when they face Ole Miss. Ole Miss is #16 in the AP poll but has dropped to #44 in the PM's ratings, so it should be an routine win for Florida and the last week for Ole Miss in the AP Top Twenty.

The big winner this week is Kansas State, jumping four places even though they lost last Saturday to Iowa State. They're benefiting from their great strength of schedule. Pittsburgh, Arizona and Gonzaga make significant leaps with strong wins or good losses. Pittsburgh is a good example of the latter -- losing to Louisville by only 3 at Louisville is a strong positive.

The big loser this week is VCU, tumbling three spots after a bad loss to La Salle at home.

Predictions for Saturday's Top Twenty-Five games:

#3 Indiana by 8 over #1 Michigan
#2 Kansas by 9.5 over Oklahoma St.
#4 Florida by 15 over #16 Ole Miss
#5 Duke by 7 over Florida State
Pittsburgh by 2.5 over #6 Syracuse
#7 Gonzaga by 18 over San Diego
#8 Arizona by 8 over Washington St
#9 Butler by 15 over Rhode Island
#10 Oregon by 3 over California
#11 tOSU by 13 over Nebraska
#19 NC State by 3 over #14 Miami (FL)
#15 Wichita State by 4 over UNI
#17 Missiouri by 19 over Auburn
Oklahoma by 1/2 over #18 Kansas St
#20 New Mexico by 13 over Nevada

The power of the home court is evident in the Indiana, Pittsburgh, and NC State games.