If a site doesn't have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what's changed on a page.
My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don't offer a feed. I don't want to track all the other junk on that page. Just the top 250.
There's a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:
| //a | Matches all <a> links |
| //p/b | Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>) |
| //table//a | Matches all links inside a table (the links need not be immediately inside the table -- anywhere inside the table works) |
You get the idea. It's like a folder structure. / matches the a tag that's immediately below. // matches a tag that's somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it -- it's much easier to try it than to read the documentation.
The following XPath matches the IMDb Top 250 exactly.
//tr//tr//tr//td[3]//a
(It's a link inside the 3rd column in a table row in a table row in a table row.)
Now, all I need is to get something that converts that to an RSS feed. I couldn't find anything on the Web, so I wrote my own XPath server. The URL:
www.s-anand.net/xpath?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a
When I subscribe to this URL on Google Reader, I get to know whenever there's a new movie on the IMDb Top 250.
This gives only the names of the movies, though, and I'd like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:
This says three things:
| //tr//tr//tr | Pick all rows in a row in a row |
| title->./td[3]//a | For each row, set the title to the link text in the 3rd column |
| link->./td[3]//a | ... and the link to the link href in the 3rd column |
That provides a more satisfactory RSS feed -- one that I've subscribed to, in fact. Another one that I track is a list of new popular movies that make it to the mininova top seeded movies category.
You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!
Comments