
Keyword searches as a Web command line

Andre’s mentions dumping Google Chrome because of lack of extension support, especially Ubiquity, and lists 15 useful Ubiquity commands.

If you haven’t seen Ubiquity, you should. It’s a great extension that transforms your browser into an Internet command prompt. It is modelled on the Enso Launcher, which is a great piece of work by itself.

I wasn’t quite prepared to let go of Chrome that easily. On Task Manager, seeing 10 Chrome processes, the largest of which takes up 60MB, is a lot more comforting, psychologically, than 1 Firefox process taking up 300MB. (I rarely hit my 1GB RAM limit, so it shouldn’t matter either way. Yet, the spendthrift in me keeps watching.)

So the question is, can I do all the items on his list without using Ubiquity?

Let’s pick the easiest. Google search. If you typed "g some words" on Ubiquity, you get the Google search results for "some words". But you already have that. If you have Firefox, typing any words on the address bar automatically does a Google search for you. On Internet Explorer, it search, but you can easily change that by installing the Google Toolbar.

But the great thing is that this can be customised. On Firefox, click on the down arrow icon next to the search box and select "Manage Search Engines…" to see a list of your search engines. Select the one you want to use, click on "Edit Keyword…" and select the keyword you want. For instance, I’ve typed "google".

Manage Search Engines Add Keywords

So when I type "google some words" on the address bar (not the search bar, the address bar) I get search results for "some words". These are called keyword searches.

On Firefox, you add your own search engines, but you do that using bookmarks. Press Ctrl-Shift-B (Organize Bookmarks) and create a New Bookmark. You can type in any URL in the location field. If you type "%s" as part of the URL, that will be replaced by the search string. So for instance, using a location and the keyword "wiki" will do a Wikipedia search for "Harry Potter" if you type "wiki Harry Potter" on the address bar.

It works on Internet Explorer as well, even with version 6. The easiest way is to download TweakUI. Go to Internet Explorer – Search. Click on the Create button. Type in a keyword (called Prefix) and a URL. If you type "%s" as part of the URL, that will be replaced by the search string.


On Google Chrome, get to the Options (what, no shortcut key?) and in the Basics tab, click the Manage button. Here, you can click on "Add" to add a search engine.


So that takes care of all the basic searches: Google, Amazon, IMDB, Wikipedia, etc.

Can we go further? Item 8 on the list caught my attention:

Twit. As much as I love full-featured Twitter clients like TweetDeck, nothing beats the simplicity of hitting Ctrl-Space and typing twit [message] to so_and_so, or sending a selection of text using twit this to so_and_so. At the moment, there’s no way to receive tweets or ping Twitter for new messages.

I don’t use Twitter, but I do use, and I would like something like this. Right now, I’m using Google Talk to update Two problems. I don’t like chatting, and logging on exposes me to a lot of distraction. Secondly, I’d rather not have to open an application just for this. Something in the browser would be perfect. But is it possible? (and Twitter, and most micro-blogging services) let you update via e-mail. So if I could write a program that would mail, I should be done. So I did that with a Perl script.

my $q = new CGI;
open OUT, "|/usr/sbin/sendmail -t";
print OUT join "\n",
    "Subject: \n\n",

So if I placed this at (no, I haven’t placed it there), I just need to create a keyword search with a prefix Tidentica" that points to Then I can type "identica Here is a message that I want to post" on the address bar, and it gets posted.

Actually, if you can write your own programs, the possibilities are endless. If you’re looking for someone to host this sort of thing for free, Google’s AppEngine may be a reasonable point to start.

But the real power of this comes with Javascript. Those URLs that you saw for keyword searches? Those can be Javascript URLs. So item 9 on the list

Word count. As a student of copywriting, I’m frequently curious about an article’s word length. Highlighting the desired text and entering word count into Ubiquity will give you just that.

… might just be possible.

It’s easy to get the selection. The following snippet gives you the current selection. (Tested in IE 5.5 – 8, Firefox 3 and Google Chrome. Should work for Opera, Safari.)

document.selection ? document.selection.createRange().text :
window.getSelection ? window.getSelection().toString() : ""

To get the word count, just split by white space, and count the results:

s = document.selection ? document.selection.createRange().text :
    window.getSelection ? window.getSelection().toString() : "";
alert(s.split(/\s+/).length + " words")

Now, this whole thing can be made into a keyword search. Let’s call it count. If I go to the address bar and type "count it", I want to use count the words in the selection. If I typed "count some set of words here", I want to count the words in "some set of words here". Here’s how to do that.

javascript:var s = "%s";
if (s == "it") {
  s = document.selection ? document.selection.createRange().text :
      window.getSelection ? window.getSelection().toString() : "";
alert(s.split(/\s+/).length + " words");

Now, put all of this in one line and add it as your keyword search. Try it!

(Note: You need to replace { curly braces } with %7B and %7D in Google Chrome. It interprets curly braces as a special command. Also, Chrome replaces spaces with a +, so the word count will always return 1 if you search for "count some set of words here".)

You could use selections to search as well. If you wanted to Google your selection, just use:

javascript:var s = "%s";
if (s == "it") {
  s = document.selection ? document.selection.createRange().text :
      window.getSelection ? window.getSelection().toString() : "";
location.replace("" + s)

Typing "google it" will search for your selected words on Google. "google some words" will search for "some words" on Google.

I’ve configured these keyword searches on my browser to:

  • Share sites. Typing "share google" adds the page to Google Reader, "share delicious" posts it to, "share digg" diggs the page, etc.
  • Send mail from the address bar. Typing "mail sub:This is the subject. Rest of the message" in the address bar will send the mail out. (Of course, you need to have created a mail gateway. I’ll try and share this shortly.)
  • Add entries to my calendar. Typing "remind Prepare dinner at 8pm" adds a reminder to my calendar to prepare dinner at 8pm.
  • Highlight parts of a page. Typing "highlight it" highlights what I’ve selected on the page. Even after I remove the selection, the highlighting stays. Typing "highlight some phrase" highlights all occurrences of "some phrase" in the entire document. The colours change every time you use it on a page, so you can search for multiple words and see where how they’re distributed.
  • Replaces tables with charts. Typing "chart it" with a table selected replaces the table with a chart. Typing "chart it as pie" or "chart it as scatter" changes the chart type.

You could actually take any bookmarklet and convert it into a keyword search. Which means that practically anything you can do on Javascript can be convert into a command-line-like syntax on the address bar.

So there it is! You can pretty much have a web command line. I wonder if we could add UNIX-pipes-like functionality.

Caching pages on Apache

I don’t use any blogging software for my site. I just hand-wired it some years ago. When doing this, one of the biggest problems was caching.

Consider each blog entry page. Each page has the same template, but different content. Both the template and content could be changed. So ideally, blog pages should be served dynamically. That is, every time someone requests the page, I should look up the content, look up the template, and put them together.

I did that, and within a few days outgrew my hosting service‘s CPU usage limit. Running such a program for every page hit is too heavy on the CPU.

One way around this is to create the pages beforehand and serve it as regular HTML. But every time the template changes, you need to re-generate every single page. I had over 2,500 pages. That would kill the CPU usage if I changed the template often.

At that point, I did a piece of analysis. Do I really need to regenerate all 2000 blog entries? Wouldn’t the 80-20 rule apply? The Apache log confirmed that 20% of the URLs were accounting for 76% of the hits. So I’d be wasting my time regenerating all the pages every time I changed the template.

Graph: 20% of URLs account for 76% of hits

So based on this, I decided to dynamically cache the pages. When a page is requested for the first time, I create the page and save it in a cache. The next time, I’d just serve it from the cache. If the template changes, I just need to delete the cache. This way, I only generate pages that are requested, and they’re only generated once.

OK, so that’s the background. Now let me get to how I did it.

I wrote a Perl script,, that would generate a page in the html folder whenever it is called. Next, I changed Apache‘s .htaccess to run this program only if the page did not exist in the html folder.

# Redirect to cache first
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^([^/]*)\.html$       html/$1.html

# If not found, run program to create page
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^html/([^/]*)\.html$$1

The first block redirects Apache to the cache. The second block checks if the file exists in the cache. If it doesn’t, the Apache redirects to the program. The program creates the page in the cache and displays it. Thereafter, Apache will just serve the file from the cache.

This Apache trick can be used in another way. I keep files organised in different folders to simplify my work. But to visitors of this site, that organisation is irrelevant. So I effectively merge these folders into one. For example, I have a folder called a in which I keep my static content. I also have this piece of code:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^([^/]+)$   a/$1

If any file is not found in the main folder, just check in the a/ folder. So I can access the file /a/hindholam.midi as /hindholam.midi as well.

This can be extended to a series of folders: either as a cascade of caches, or to merge many folders into one.

JPath – XPath for Javascript

XPath is a neat way of navigating deep XML structures. It’s like using a directory structure. /table//td gets all the TDs somewhere below TABLE.

Usually, you don’t need this sort of a thing for data structures, particularly in JavaScript. Something like would already work. But sometimes, it does help to have something like XPath even for data structures, so I built a simple XPath-like processor for Javascript called JPath.

Here are some examples of how it would work:

jpath(context, “para”) returns context.para
jpath(context, “*”) returns all values of context (for both arrays and objects)
jpath(context, “para[0]”) returns context.para[0]
jpath(context, “para[last()]”) returns context.para[context.para.length]
jpath(context, “*/para”) returns context[all children].para
jpath(context, “/doc/chapter[5]/section[2]”) returns context.doc.chapter[5].section[2]
jpath(context, “chapter//para”) returns all para elements inside context.chapter
jpath(context, “//para”) returns all para elements inside context
jpath(context, “//olist/item”) returns all olist.item elements inside context
jpath(context, “.”) returns the context
jpath(context, “.//para”) same as //para
jpath(context, “//para/..”) returns the parent of all para elements inside context

Some caveats:

  • This is an implementation of the abbreviated syntax of XPath. You can’t use axis::nodetest
  • No functions are supported other than last()
  • Only node name tests are allowed, no nodetype tests. So you can’t do text() and node()
  • Indices are zero-based, not 1-based

There are a couple of reasons why this sort of thing is useful.

  • Extracting attributes deep down. Suppose you had an array of arrays, and you wanted the first element of each array.
    Column Selection
    You could do this the long way:
    for (var list=[], i=0; i < data.length; i++) {

    ... or the short way:

    $.map(data, function(v) {
        return v[1];

    But the best would be something like:

    jpath(data, "//1")
  • Ragged data structures. Take for example the results from Google's AJAX feed API.
    {"responseData": {
     "feed": {
      "title": "Digg",
      "link": "",
      "author": "",
      "description": "Digg",
      "type": "rss20",
      "entries": [
        "title": "The Pirate Bay Moves Servers to Egypt Due to Copyright Laws",
        "link": "",
        "author": "",
        "publishedDate": "Mon, 31 Mar 2008 23:13:33 -0700",
        "contentSnippet": "Due to the new copyright legislation that are going ...",
        "content": "Due to the new copyright legislation that are going to take...",
        "categories": [
        "title": "Millions Dead/Dying in Recent Mass-Rick-Rolling by YouTube.",
        "link": "",
        "author": "",
        "publishedDate": "Mon, 31 Mar 2008 22:53:30 -0700",
        "contentSnippet": "Click on any \u0022Featured Videos\u0022. When will the insanity stop?",
        "content": "Click on any \u0022Featured Videos\u0022. When will the insanity stop?",
        "categories": [
    , "responseDetails": null, "responseStatus": 200}

    If you wanted all the title entries, including the feed title, the choice is between:

    var titles = [ result.feed.title ];
    for (var i=0, l=result.feed.entries.length; i<l; i++) {

    ... versus...

    titles = jpath(result, '//title');

    If, further, you wanted the list of all categories at one shot, you could use:

    jpath(result, "//categories/*")

In search of a good editor

It’s amazing how hard it is to get a good programming editor. I’ve played around with more editors/IDEs than I care to remember: e Notepad++ NoteTab SciTE Crimson Editor Komodo Eclipse Aptana

There are four features that are critical to me.

  • Syntax highlighting. Over time, I’ve found this to increase readability dramatically. Look at this piece of code with and without syntax highlighting:
    Syntax Highlighting
    Doesn’t the structure of the document just jump out with syntax highlighting? Anyway, I’ve gotten used to that.
  • Column editing. I want to be able to do this:
    Column Editing
    Being able to type across rows is incredibly useful. I use it both for programming as well as to complement data-processing on Excel.
  • Unicode support. I often work with non-ASCII files, particularly in Tamil. Unicode support comes in handy when debugging pages for my songs site.
  • Auto-completion. This is 10 times more productive than having to look up the manual for each function.

(Oh, and it’s got to be free too. Except for e Text Editor, all the others qualify.)

The problem is, none of the browsers that I’ve looked at support all of these features.

Editor Syntax highlighting Column editing Unicode support Auto-completion
e Text Editor Yes Yes No Yes
Crimson Editor Yes Yes No No
Notepad++ Yes No Yes No
NoteTab-Lite No No No No
SciTE Yes No Yes Yes
TextPad Yes No Yes No
UltraEdit Yes No No ?
Aptana Yes No Yes Yes
Eclipse Yes No Yes Yes
Komodo Yes No Yes Yes

Wikipedia has a more in-depth comparison of text editors.

Actually, there’s another parameter that’s pretty important: responsiveness. When I type something, I want to see it on the screen. Right that millisecond. With some of the features added by these editors, there’s so much bloat that it often takes up to one second between the keypress and the refresh. That’s just not OK.

I’ve settled on Crimson Editor as my default editor these days, simply because it’s quick and has column editing. (Column editing on e Text Editor is a bit harder to use.) When I am writing Unicode, I switch over to Notepad++. For large programs, I’m leaning towards Komodo right now, largely because Eclipse is bloated and Aptana was slow. (Komodo is slow too. Maybe I’ll switch back.)

There’s many other things on my “would love to have” features, like regular-expression search and replace, line sorting, code folding, brace matching, word wrapping, etc. Most of those, though, are either not too important, or most browsers already have them.

Well, there’s the sad thing. I’ve been hunting for a good text editor for over 10 years now. May someone write a lightweight IDE with column editing.

Automating Internet Explorer with jQuery

Most of my screen-scraping so far has been through Perl (typically WWW::Mechanize). The big problem is that it doesn’t support Javascript, which can often be an issue:

  • The content may be Javascript-based. For example, shows the bestseller book list only if you have Javascript enabled. So if you’re scraping the Amazon main page for the books bestseller list, you won’t get it from the static HTML.
  • The navigation may require Javascript. Instead of links or buttons in forms, you might have Javascript functions. Many pages use these, and not all of them degrade gracefully into HTML. (Try using Google Video without Javascript.)
  • The login page uses Javascript. It creates some crazy session ID, and you need Javascript to reproduce what it does.
  • You might be testing a Javascript-based web-page. This was my main problem: how do I automate testing my pages, given that I make a lot of mistakes?

There are many approaches to overcoming this. The easiest is to use Win32::IE::Mechanize, which uses Internet Explorer in the background to actually load the page and do the scraping. It’s a bit slower than scraping just the HTML, but it’ll get the job done.

Another is to use Rhino. John Resig has written env.js that mimics the browser environment, and on most simple pages, it handles the Javascript quite well.

I would rather have a hybrid of both approaches. I don’t like the WWW::Mechanize interface. I’ve gotten used to jQuery‘s rather powerful selectors and chainability. So I’ll tell you a way of using jQuery to screen-scrape offline using Python. (It doesn’t have to be Python. Perl, Ruby, Javascript… any scripting language that can use COM on Windows will work.)

Let’s take Google Video. Currently, it relies almost entirely on Javascript. The video marked in red below appears only if you have Javascript.

The left box showing the top video uses Javascript

I’d like an automated way of checking what video is on top on Google Video every hour, and save the details. Clearly a task for automation, and clearly not one for pure HTML-scraping.

I know the video’s details are stored in elements with the following IDs (thanks to XPath checker):

ID What’s there
hs_title_link Link to the video
hs_duration_date Duration and date
hs_ratings Ratings. The stars indicate the rating and the span.Votes element inside it has the number of people who rated it.
hs_site The site that hosts the video
hs_description Short description

So I could do the following on Win32::IE::Mechanize.

use Win32::IE::Mechanize;
my $ie = Win32::IE::Mechanize->new( visible => 1 );
my @links = $ie->links
# ... then what?

I could go through each link to extract the hs_title_link, but there’s no way to get the other stuff.

Instead, we could take advantage of a couple of facts:

  • Internet Explorer exposes a COM interface. That’s what Win32::IE::Mechanize uses. You can use it in any scripting language (Perl, Ruby, Javascript, …) on Windows to control IE.
  • You can load jQuery on to any page. Just add a <script> tag pointing to jQuery. Then, you can call jQuery from the scripting language!

Let’s take this step by step. This Python program opens IE, loads Google Video and prints the text.

# Start Internet Explorer
import win32com.client
ie = win32com.client.Dispatch("InternetExplorer.Application")
# Display IE, so you'll know what's happening
ie.visible = 1
# Go to Google Video
# Wait till the page is loaded
from time import sleep
while ie.Busy: sleep(0.2)
# Print the contents
# Watch out for Unicode
print ie.document.body.innertext.encode("utf-8")

The next step is to add jQuery to the Google Video page.

# Add the jQuery script to the browser
def addJQuery(browser,
    document = browser.document
    window = document.parentWindow
    head = document.getElementsByTagName("head")[0]
    script = document.createElement("script")
    script.type = "text/javascript"
    script.src = url
    while not window.jQuery: sleep(0.1)
    return window.jQuery
jQuery = addJQuery(ie)

Now the variable jQuery contains the Javascript jQuery object. From here on, you can hardly tell if you’re working in Javascript or Python. Below are the expressions (in Python!) to get the video’s details.

# Video title: "McCain's YouTube Problem ..."
# Title link: '/videoplay?docid=1750591377151076231'
# Duration and date: '3 min - May 18, 2008 - '
# Rating: 5.0
jQuery("#hs_ratings img").length
# Number of ratings '(8,288 Ratings) '
jQuery("#hs_ratings span.Votes").text()
# Site: 'Watch this video on'
# Video description

This wouldn’t have worked out as neatly in Perl, simply because you’d need to use -> instead of . (dot). With Python (and with Ruby and Javascript on cscript), you can almost cut-and-paste jQuery code.

If you want to click on the top video link, use:


In addition, you can use the keyboard as well. If you want to type username TAB password, use this:

shell = win32com.client.Dispatch("WScript.Shell")

You can use any of the arrow keys, control keys, etc. Refer to the SendKeys Method on MSDN.

Statistically improbable phrases on Google AppEngine

I read about Google AppEngine early this morning, and applied for an invite. Google’s issuing beta invites to the first 10,000 users. I was pretty convinced I wasn’t among those, but turns out I was lucky.

AppEngine lets you write web apps that Google hosts. People have been highlighting that it give you access to the Google File System and BigTable for the first time. But to me, that isn’t a big deal. (I’m not too worried about reliability, and MySQL / flat files work perfectly well for me as a data store.)

What’s more interesting unlike Amazon’s EC2 and S3, this is free up to a certain quota. And you get a fair bit of processing power and bandwidth for free. One of the reasons I’ve held back on creating some apps was simply because it would take away too much bandwidth / CPU cycles from my site. (I’ve had this problem before.) Google quota is 10 GB of bandwidth per day (which is about 30 times what my site uses). And this is on Google’s incredibly fast servers It also offers 200 million megacycles a day. That’s like a dedicated 2.3 GHz processor (200 million megacycles = 200,000 GHz x 1 second ~ 2.3 GHz x 86,400 seconds/day) — better, because this is the average capacity, not peak capacity. The only restriction that really worries me is that only 3 apps are allowed per developer.

So I decided to give a shot at publishing some code I’d kept in reserve for a long time. You may remember my statistical analysis of Calvin & Hobbes. For this, I’d created a script in Perl that could generate Statistically Improbable Phrases (SIPs) for any text. This is based on (a somewhat limited) 23MB corpus of ebooks that I had. I’d wanted to put that up on my website, but …

AppEngine only uses Python. So the first task was to get Python, and then to learn Python. The only saving grace was that I was just cutting-and-pasting most of the time. Google wasn’t helping:

Google AppEngine Over Quota Error

Anyway, the site is up. You can view it at for now. Just type a URL, and it’ll tell you the improbable words in that site.


Technical notes

I realise that these are statistically improbable words, not phrases. I’ll get to the phrases in a while.

The logic is simple:

  • Get the frequency of words in a corpus. I pre-generated this file. It has over 100,000 words.
  • Get the URL as text. Rather than muck around with Python, I decided to use the W3 html2txt service.
  • Convert the text to words. Splitting text into words is tricky. For now, I’m simply assuming that any group of letters is a word, and anything that’s not a letter is a word delimiter.
  • Find the relative frequency (improbability) of words. This is the frequency in the URL divided by the frequency in the corpus, normalised (i.e. scale it so that the maximum value is 1.0).
  • Create a tag cloud. I use the word frequency as the size and the improbability as the colour. You need a bit of mathematical jugglery to get the pattern right. Right now, I’m taking the 6th root of the improbability and the logarithm of the frequency to get a reasonably smooth tag cloud.

The source code is at

Update: 12-Apr-2008. I’ve added some interactivity. You can play with the contrast and font size, the filter out common or infrequent words.

Update: 22-Apr-2008. Added concordance. You can click on a word and see the context in which it appears.

Chaining functions in Javascript

One of the coolest features of jQuery is the ability to chain functions. The output of a function is the calling object. So instead of writing:

var a = $("<div></div>");

… I can instead write:


A reasonable number of predefined Javascript functions can be used this way. I make extensive use of it with the String.replace function.

But where this feature is not available, you an create it in a fairly unobstrusive way. Just add this code to your script:

Function.prototype.chain = function() {
var that = this;
return function() {
    // New function runs the old function
    var retVal = that.apply(this, arguments);
    // Returns "this" if old function returned nothing
    if (typeof retVal == "undefined") { return this; }
                // else returns old value
    else { return retVal; }
var chain = function(obj) {
        for (var fn in obj) {
                if (typeof obj[fn] == "function") {
                    obj[fn] = obj[fn].chain();
        return obj;

Now, chain(object) returns the same object, with all its functions replaced with chainable versions.

What’s the use? Well, take the Google AJAX search API. Normally, to search for the top 8 “Harry Potter” PDFs on, I’d have to do:

    var searcher = new;
    searcher.execute("Harry Potter");

Instead, I can now do this:

.execute("Harry Potter");

(On the whole, it’s probably not worth the effort. Somehow, I just like code that looks like this.)

Javascript error logging

If something goes wrong with my site, I like to know of it. My top three problems are:

  1. The site is down
  2. A page is missing
  3. Javascript isn’t working

This is the last of 3 articles on these topics.

I am a bad programmer

I am not a professional developer. In fact, I’m not a developer at all. I’m a management consultant. (Usually, it’s myself I’m trying to convince.)

Since no one pays me for what little code I write, no one shouts at me for getting it wrong. So I have a happy and sloppy coding style. I write what I feel like, and publish it. I don’t test it. Worse, sometimes, I don’t even run it once. I’ve sent little scripts off to people which wouldn’t even compile. I make changes to this site at midnight, upload it, and go off to sleep without checking if the change has crashed the site or not.

But no one tells me so

At work, that’s usually OK. On the few occasions where I’ve written Perl scripts or VB Macros that don’t work, people call me back within a few hours, very worried that THEY’d done something wrong. (Sometimes, I don’t contradict them.) It can be quite a stressful experience but good thing you can learn more here on how to cope up with it.

On my site, I don’t always get that kind of feedback. People just click the back button and go elsewhere.

Recently, I’ve been doing more Javascript work on my site than writing stuff. Usually, the code works for me. (I write it for myself in the first place.) But I end up optimising for Firefox rather than IE, and for the plugins I have, etc. When I try the same app a few months later on my media PC, it doesn’t work, and shockingly enough, no one’s bothered telling me about it all these months. They’d just click, nothing happens, they’d vanish.

But their browsers can tell me

The good part about writing code in Javascript is that I can catch exceptions. Any Javascript error can be trapped. So since the end of last year, I’ve started wrapping almost every Javascript function I write in a try {} catch() {} block. In the catch block, I send a log message reporting the error.

The code looks something like this:

function log(e, msg) {
    for (var i in e) { msg += i + "=" + e[i] + "\n"; }
    (new Image()).src="" + encodeURIComponent(msg);

function abc() {
    try {
    // ... function code
    } catch(e) { log(e, "abc"); }

Any time there’s an error in function abc, the log function is called. It sends the function name ("abc") and the error details (the contents of the error event) to, which stores the error, along with details like the URL, browser, time and IP address. This way, I know exactly where what error occurs.

This is a fantastic for a three reasons.

  • It tells me when I’ve goofed up. This is instantaneous feedback. I don’t have to wait for a human. If you run my program on your machine, and it fails, I get to know immediately. (Well, as soon as I read the error log, at least.)
  • It tells me where I’ve goofed up. The URL and the function name clearly indicate the point of failure.
  • It tells me why I’ve goofed up. Almost. Using the browser name and the error message, I can invariably pinpoint the reason for the error. Then it’s just a matter of taking the time to fix it.

I’d think this sort of error reporting should be the norm for any software. At least for a web app, given how easy it is to implement.

Scraping RSS feeds using XPath

If a site doesn’t have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what’s changed on a page.

My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don’t offer a feed. I don’t want to track all the other junk on that page. Just the top 250.

There’s a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:

//a Matches all <a> links
//p/b Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>)
//table//a Matches all links inside a table (the links need not be immediately inside the table — anywhere inside the table works)

You get the idea. It’s like a folder structure. / matches the a tag that’s immediately below. // matches a tag that’s somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it — it’s much easier to try it than to read the documentation.

The following XPath matches the IMDb Top 250 exactly.


(It’s a link inside the 3rd column in a table row in a table row in a table row.)

Now, all I need is to get something that converts that to an RSS feed. I couldn’t find anything on the Web, so I wrote my own XPath server. The URL:

When I subscribe to this URL on Google Reader, I get to know whenever there’s a new movie on the IMDb Top 250.

This gives only the names of the movies, though, and I’d like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:

xpath=//tr//tr//tr title->./td[3]//a link->./td[3]//a/@href

This says three things:

//tr//tr//tr Pick all rows in a row in a row
title->./td[3]//a For each row, set the title to the link text in the 3rd column
link->./td[3]//a … and the link to the link href in the 3rd column

That provides a more satisfactory RSS feed — one that I’ve subscribed to, in fact. Another one that I track is a list of mininova top seeded movies category.

You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!