Scraping RSS feeds using XPath

If a site doesn’t have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what’s changed on a page.

My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don’t offer a feed. I don’t want to track all the other junk on that page. Just the top 250.

There’s a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:

//a	Matches all <a> links
//p/b	Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>)
//table//a	Matches all links inside a table (the links need not be immediately inside the table — anywhere inside the table works)

You get the idea. It’s like a folder structure. / matches the a tag that’s immediately below. // matches a tag that’s somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it — it’s much easier to try it than to read the documentation.

The following XPath matches the IMDb Top 250 exactly.

//tr//tr//tr//td[3]//a

(It’s a link inside the 3rd column in a table row in a table row in a table row.)

Now, all I need is to get something that converts that to an RSS feed. I couldn’t find anything on the Web, so I wrote my own XPath server. The URL:

www.s-anand.net/xpath?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a

When I subscribe to this URL on Google Reader, I get to know whenever there’s a new movie on the IMDb Top 250.

This gives only the names of the movies, though, and I’d like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:

xpath=//tr//tr//tr title->./td[3]//a link->./td[3]//a/@href

This says three things:

//tr//tr//tr	Pick all rows in a row in a row
title->./td[3]//a	For each row, set the title to the link text in the 3rd column
link->./td[3]//a	… and the link to the link href in the 3rd column

That provides a more satisfactory RSS feed — one that I’ve subscribed to, in fact. Another one that I track is a list of mininova top seeded movies category.

You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!

7 thoughts on “Scraping RSS feeds using XPath”

Mark
December 17, 2007 at 12:00 pm

Have you ever thought about introducing authentication to the XPath server? I would like to parse certain fields of a page that is authenticated with cookies.
S Anand
October 28, 2008 at 1:43 am

Sure Rog. I’ve mailed it to you
Rog
October 28, 2008 at 1:07 am

Any chance you could share your xpath.php code? It seems the server is no longer available.
S Anand
March 7, 2009 at 10:35 am

Post Yahoo’s introduction of Yahoo Query Language, you’re much better off using that instead of my XPath utility. I’ve covered it in this article on client side scraping.
Pingback: Scraping your way to RSS Feeds! « Technosiastic!
Bart P
March 3, 2012 at 11:48 am

It would be great if you could share this code, I really like to use this server, but want to remove session ids from the links (so my reader doesn’t think all links are new every time).

Is that possible? 🙂
Ben
June 11, 2015 at 12:46 am

Is that possible to share your xpath.php code? yahoo pipes is going to be shut down 🙁

Scraping RSS feeds using XPath

7 thoughts on “Scraping RSS feeds using XPath”

Leave a Comment

Categories

Archives

Collections

Pages

Related Posts

7 thoughts on “Scraping RSS feeds using XPath”

Leave a Comment