Scraping RSS feeds using XPath

If a site doesn't have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what's changed on a page.

My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don't offer a feed. I don't want to track all the other junk on that page. Just the top 250.

There's a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:

//aMatches all <a> links
//p/bMatches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>)
//table//aMatches all links inside a table (the links need not be immediately inside the table -- anywhere inside the table works)

You get the idea. It's like a folder structure. / matches the a tag that's immediately below. // matches a tag that's somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it -- it's much easier to try it than to read the documentation.

The following XPath matches the IMDb Top 250 exactly.

//tr//tr//tr//td[3]//a

(It's a link inside the 3rd column in a table row in a table row in a table row.)

Now, all I need is to get something that converts that to an RSS feed. I couldn't find anything on the Web, so I wrote my own XPath server. The URL:

sanand.110mb.com/xpath.php?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a

When I subscribe to this URL on Google Reader, I get to know whenever there's a new movie on the IMDb Top 250.

This gives only the names of the movies, though, and I'd like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:

xpath=//tr//tr//tr title->./td[3]//a link->./td[3]//a/@href

This says three things:

//tr//tr//trPick all rows in a row in a row
title->./td[3]//aFor each row, set the title to the link text in the 3rd column
link->./td[3]//a... and the link to the link href in the 3rd column

That provides a more satisfactory RSS feed -- one that I've subscribed to, in fact. Another one that I track is a list of new popular movies that make it to the mininova top seeded movies category.

You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!

Written on 17 Dec 2007 | alternate titles: Scraping RSS feeds using XPath XPath server

Comments


(not shared, not spammed)


S Anand, Infosys Consulting, London UK. +44 7957 440 260