Client side scraping

“Scraping” is extracting content from a website. It’s often used to build something on top of the existing content. For example, I’ve built a site that tracks movies on the IMDb 250 by scraping content.

There are libraries that simplify scraping in most languages:

But all of these are on the server side. That is, the program scrapes from your machine. Can you write a web page where the viewer’s machine does the scraping?

Let’s take an example. I want to display Amazon’s bestsellers that cost less than $10. I could write a program that scrapes the site and get that information. But since the list updates hourly, I’ll have to run it every hour.

That may not be so bad. But consider Twitter. I want to display the latest iPhone tweets from http://search.twitter.com/search.atom?q=iPhone, but the results change so fast that your server can’t keep up.

Nor do you want it to. Ideally, your scraper should just be Javascript on your web page. Any time someone visits, their machine does the scraping. The bandwidth is theirs, and you avoid the popularity tax.

This is quite easily done using Yahoo Query Language. YQL converts the web into a database. All web pages are in a table called html, which has 2 fields: url and xpath. You can get IBM’s home page using:

select * from html where url="http://www.ibm.com"

Try it at Yahoo’s developer console. The whole page is loaded into the query.results element. This can be retrieved using JSONP. Assuming you have jQuery, try the following on Firebug. You should see the contents of IBM’s site on your page.

$.getJSON(
  'http://query.yahooapis.com/v1/public/yql?callback=?',
  {
    q: 'select * from html where url="http://www.ibm.com"',
    format: 'json'
  },
  function(data) {
    console.log(data.query.results)
  }
);

That’s it! Now, it’s pretty easy to scrape, especially with XPath. To get the links on IBM’s page, just change the query to

select * from html where url="http://www.ibm.com" and xpath="//a"

Or to get all external links from IBM’s site:

select * from html where url="http://www.ibm.com" and xpath="//a[not(contains(@href,'ibm.com'))][contains(@href,'http')]""

Now you can display this on your own site, using jQuery.

 

This leads to interesting possibilities, such as Map-Reduce in the browser. Here’s one example. Each movie on the IMDb (e.g. The Dark Knight) comes with a list of recommendations (like this). I want to build a repository of recommendations based on the IMDb Top 250. So here’s the algorithm. First, I’ll get the IMDb Top 250 using:

select * from html where url="http://www.imdb.com/chart/top" and xpath="//tr//tr//tr//td[3]//a"

Then I’ll get a random movie’s recommendations like this:

select * from html where url="http://www.imdb.com/title/tt0468569/recommendations" and xpath="//td/font//a[contains(@href,'/title/')]"

Then I’ll send off the results to my aggregator.

Check out the full code at http://250.s-anand.net/build-reco.js.

 

In fact, if you visited my IMDb Top 250 tracker, you already ran this code. You didn’t know it, but you just shared a bit of your bandwidth and computation power with me. (Thank you.)

And, if you think a little further, here another way of monetising content: by borrowing a bit of the user’s computation power to build complex tasks. There already are startups built around this concept.

8 thoughts on “Client side scraping”

  1. Pingback: grep imdb part 2 ? « taeyoungchoon

  2. Hey Anand,
    Great Article. Just one query though in the client side implementation its the client IP that will be hitting the website (in your case http://www.ibm.com)? or it will use yahoo SQL server IP to hit.

Leave a Comment

Your email address will not be published. Required fields are marked *