Wednesday, December 12, 2012

Another Tool: Emacs

The tool chain I use for these experiments in predicting NCAA basketball games consists primarily of three tools:  Web Harvest, Emacs, SBCL (an implementation of Common Lisp), and RapidMiner.   I have written previously on this blog about RapidMiner, and today I'm going to touch on Emacs.

Most people use editors in a fairly static way.  They figure out how to do the editing tasks they need as some combination of commands and keystrokes, and that becomes part of their editing knowledge.  For example, suppose you're the sort of typist who regularly transposes letters, writing "the" as "teh" and "and" as "adn".  You'll fairly quickly figure out how to fix that problem -- backspace, backspace and retype, or maybe mark with a mouse, delete and retype -- and that becomes part of your editing knowledge.

A well-designed editor ensures that most of what you need to do is efficient.  The most common tasks have easy keystrokes and so on.  Of course, the editor is designed for some general, idealized user, who may be very like you in some ways and very different from you in other ways.  So it's likely that much of what you do in your favorite editor is efficient, but some of it is very inefficient and repetitive.

Emacs takes a different philosophy -- probably because it grew out of a community of programmers.

I recently read a blog posting saying that the essence of programming is to avoid repetition.  Programmers go to great lengths creating subroutines, libraries and sometimes whole new programming languages just to avoid mindless repetition of some task.  The ideal programmer spends all his time creating unique solutions to unique problems -- everything else is automated.  Emacs captures this same philosophy in an editor.  Emacs is designed so that it can be customized/programmed by the user in powerful and flexible ways, so that the user spends all his time being "maximally" productive.  (Expert Emacs users do so much customization of the editor that a big topic is how to best manage the customizations!)

To return to the above example, if you're a typist who often transposes letters and Emacs is your editor, your response is to customize/program Emacs to take care of your transposing problem more efficiently.  For example, you'd likely define a key to reverse the transposition of the previous two characters, so that you could just hit that key and fix the problem whenever it occurred.

(As it happens, Emacs already has this capability -- it's ctl-t -- but you see the point.)

This philosophy changes the way you use your editor -- it becomes a kind of Swiss Army knife tool for solving all text-related tasks (and often other types of tasks as well).  For example, in the basketball predictor I scrape scores and other statistics from the Web and use them in both Common Lisp and in RapidMiner.  For use in Common Lisp, it is convenient to have the data in a list format like this:
("Miss. Valley St" 40 18 70 4 29 9 14 16 42 70 "2012-12-10")
For use in RapidMiner (and Excel) it is more convenient to have the data in a comma-separated values format like this:
"Miss. Valley St", 40, 18, 70, 4, 29, 9, 14, 16, 42, 70, "2012-12-10"
It's not difficult to translate from one to the other, but it is boring and repetitious.  The natural response in Emacs is to automate the task.

Emacs has a variety of ways to do this.  (If you know Emacs, this won't surprise you!)  One of the simplest is a "keyboard macro".  You tell Emacs you want to define a keyboard macro, and then you start editing.  When you tell it your done, it captures all the editing you did in-between and allows you to repeat that with a single keystroke.  In this case, I would start a keyboard macro, go through all the editing necessary to convert one line of my data file from one format to the other, and then end the macro.  Then I could go to the next line and tell Emacs to "execute the keyboard macro" and -- voila! -- that line would get the same editing. It takes some practice and thought to create an editing sequence that will do the right thing when it is repeated on the next line, but this turns out to be a very powerful and handy feature.

One of the drawbacks of the keyboard macro is that it disappears when you end your editing session.  So it's not useful per se for a task that you're going to want to repeat another day on a different file.  Fortunately, Emacs provides a way to save a keyboard macro in a format that looks like this:

(fset 'fix-sched
   [escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?^ return ?\( ?\" return escape ?< escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?  ?* ?, return ?\" ?  ?\" return escape ?< escape ?r ?\" ?+ return return escape ?< escape ?r ?\" ?- return return escape ?< escape ?x ?r ?e ?p ?l ?a ?c ?e ?- ?r ?e ?g ?e ?x ?p return ?$ return ?\) return])
That's exactly what it looks like -- a literal transcription of the keystrokes of the macro.  In this form, it can be saved your Emacs configuration file so that the next time you start up Emacs it will be available for reuse.

At a more complex level, you can program Emacs using a form of Lisp.  You can use this to create arbitrary functionality.  For example, if Emacs didn't provide a way to save a keyboard macro, you could program that yourself.  This allows you to build functionality that isn't easy to capture in a keyboard macro.  For example, here's an Emacs function I wrote for fixing a certain type of score file:
(defun fix-scores ()
 "Fix the scores from Marsee"
 (interactive "*")
 (let ((dt (format-time-string current-date-format-marsee)))
   (replace-regexp
    "^\\([A-Za-z \\&]+[A-Za-z]\\)\\s +\\([0-9]+\\)\\s +\\([A-Za-z \\&]+[A-Za-z]\\)\\s +\\([0-9]+\\).*$"
    (concat "(\"" dt "\" \"\\1\" \\2 \"\\3\" \\4)")
    )
   )
 )
Without going into the gory details, you can see that part of this function reformats the date from the score file into a more desirable format, using an Emacs function called "format-time-string".   Emacs Lisp is infinitely powerful, so if you're a good programmer you can extend the Emacs functionality in unlimited ways.

I have more to say on Emacs, but this posting has gotten fairly long so I'll leave further thoughts to another day.

Thursday, December 6, 2012

Some Recent Papers

Reviews of some recent papers on sports prediction.
Forecasting in the NBA and Other Team Sports: Network Effects in Action
PEDRO O. S. VAZ DE MELO, VIRGILIO A. F. ALMEIDA, and ANTONIO A. F. LOUREIRO,
Universidade Federal de Minas Gerais
CHRISTOS FALOUTSOS, Carnegie Mellon University
This paper looks at predicting the overall season performance of NBA teams (won-loss record) based upon features having to do with the team's year-to-year composition, such as "team volatility", "team inexperience," and so on.  (The authors call these features "network effects" because they model the NBA as a network of nodes representing players & coaches, with network links representing business relationships like "played for" or "played with".)  The model does surprisingly well at predicting season performance when compared against a variety of other models.

From the viewpoint of predicting NCAA basketball games, this work has limited applicability.  First of all, these authors are predicting the outcome of the entire season, not individual games.  Second, the nature of the NBA -- with the most important players having 10+ year careers and often changing teams -- makes the year-to-year movement of players more relevant than in the NCAA game.  On the other hand, any predictive value this information has seems likely to be orthogonal to the information from past game performances, which would be valuable.

A network-based ranking system for US college football
Juyong Park and M. E. J. Newman
Department of Physics and Center for the Study of Complex Systems,
University of Michigan, Ann Arbor, MI 48109
This paper ranks college football teams by calculating a score based upon "total win score" and "total loss score".  The total win score is the sum the team's total wins plus the total win score of all the opponents it beat (discounted by a constant factor).  Total loss score is calculated in a similar way, and the final score is total win score minus total loss score.

This approach is similar to systems like infinitely deep RPI, or Govan ratings, although the former uses win percentage rather than wins and losses, and the latter uses points scored/allowed.  This approach seems to do fairly well at ranking (the authors didn't use it for prediction) and may be worth trying for college basketball.
Are Sports Betting Markets Prediction Markets?
Evidence from a New Test
Kyle J. Kain, and Trevon D. Logan
This paper looks at the predictive value of point spreads and over/under lines from bookmakers on NFL, NBA, NCAA college football, and NCAA college basketball games from 2004-2010.  Without delving into the details, the bottom line from the paper is:
Our joint tests revealed that while the betting line is an accurate predictor of the margin of victory, the over/under is a poor predictor of the sum of scores in a contest.
I suspect this is because over/under is much more difficult to predict.  But this suggests that if you're out to beat the bookmakers, you might want to focus your efforts on predicting over/under rather than margin of victory.

Using ELO ratings for match result prediction in association football
Lars Magnus Hvattum, Halvard Arntzen
This paper applies the ELO rating to association football and compares it to various other predictors.  Vanilla ELO uses just the match outcome, but the authors modified the algorithm to use the score differential as well.  Performance was on par with other statistical predictors, but did not beat the oddsmakers.