Web lookup using Google Spreadsheets

I’d written earlier about Web lookup in Excel. I showed an example how you could create a movie wishlist that showed the links to the torrents from Mininova.

You can do that even easier on Google Spreadsheets. It has 4 functions that let you import external data:

  • =importData(“URL of CSV or TSV file”).
    Imports a comma-separated or tab-separated file.
  • =importFeed(URL).vLets you import any Atom or RSS feed.
  • =importHtml(URL, “list” | “table”, index).
    Imports a table or list from any web page.
  • =importXML(“URL”,”query”).
    Imports anything from any web page using XPath.

Firstly, you can see straight off why it’s easy to view RSS feeds in Google Spreadsheets. Just use the importFeed function straight away. So, for example, if I wanted to track all 8GB iPods on Google Base, I can import its feed in Google Spreadsheets.

Google Spreadsheets ImportFeed

This automatically creates a list of the latest 8GB iPods.

Incidentally, the “Price” column doesn’t appear automatically. It’s a part of the description. But it’s quite easy to get the price using the standard Excel functions. Let’s say the description is in cell C1. =MID(C1, FIND("Price", C1), 20) gets you the 20 characters starting from “Price”. Then you can sort and play around as usual.

The other powerful thing about Google Spreadsheets is the CONTINUE function. The importFeed function creates a multi-dimensional array. You can extract any cell from the array (for example, row 3, column 2 from cell C1) using CONTINUE(C1, 3, 2). So you can just pick up the title and description, or only alternate rows, or put all rows and columns in a single column — whatever.

Google Spreadsheets CONTINUE

The most versatile of the import functions is the importXML function. It lets you import any URL (including an RSS feed), filtering only the XPath you need. As I mentioned earlier, you can scrape any site using XPath.

For example, =importXML("http://www.imdb.com/chart/top", "//table//table//table//a") imports the top 250 movies from the IMDb Top 250. the second parameter says, get all links (a) inside a table inside a table inside a table. This populates a list with the entire Top 250.

Google Spreadsheets - ImportXML

Now, against each of these, we could get a feed of Mininova’s torrents. Mininova’s RSS URL is http://www.mininova.org/rss/search_string. So, in cell B1, I can get a torrent for the cell A1 (The Godfather) using the importFeed function. (Note: you need to replace spaces with a + symbol. These functions don’t like invalid URLs.).

Google Spreadsheets - Import Mininova Feed

Just copy this formula down to get a torrent against each of the IMDb Top 250 movies!

Check out the sheet I’ve created. (You need a Google account to see the sheet. If you don’t want have one, you can view the sheet.)


Now, that’s still not the best of it. You can extract this file as an RSS feed! Google lets you publish your sheets as HTML, PDF, Text, XLS, etc. and RSS and Atom are included as well. Here’s the RSS feed for my sheet above.

Think about it. We now have an application that sucks in data from a web page, does a web-based vlookup on another service, and returns the results as an RSS feed!

There are only two catches to this. The first is that Google has restricted us to 50 import functions per sheet. So you can’t really have the IMDb Top 250 populated here — only the top 49. The second is that the spreadsheet updates only when you open it again. So it’s not really a dynamically updating feed. You need to open the spreadsheet to refresh it.

But if you really wanted these things, there’s always Yahoo! Pipes.

Taare Zameen Par and Calvin

Watch this segment of Taare Zameen Par.

Then check these Calvin & Hobbes strips.

A small red spacecraft breaks through the cloud cover of Mysterio system planet 6! At the controls, it's none other than our fearless hero, Spaceman Spiff! Piloting over the lifeless world, he reflects on his unusual mission... QUIZ 1. 6+5= ... to somehoe crash planets 6 and 5 together!

In a scientific mission to discover what happens when two planets collide, Spaceman Spiff drops anchor! The anchor catches on a hillside! Spiff downshifts and guns the motor! Imperceptibly at first, the planet slowly moves, towed along by our hero, until... ...breaking orbit, planet 6 picks up speed, hurling towards planet 5!

Pulled by Spaceman Spiff, planet 6 is about to collide with planet 5! With no time to lose, our hero cuts loose the anchor and flies to safety! The planets crash, grinding and shattering with awful force! Planet 5, being smaller, is crunched to dust! Only 6 remains! 6+5=6. Time! Pass your papers forward. TIME?! I just finished the first problem!

Bless them both — Aamir and Bill.

Tamil songs quiz – Enchanting first interludes

Some background scores just stay in your mind. Here is a tribute to 20 wonderful first interludes, dating from the 1980s to the 2000s. Can you guess which movies they are from? (My intention here is not to make this tough, but rather to let you enjoy the music. So hope to see most of you score 20/20)

Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.

Search for the song and listen online, if you want to confirm your guess.

Score: 0 / 20
Song 1
Song 2
Song 3
Song 4
Song 5
Song 6
Song 7
Song 8
Song 9
Song 10
Song 11
Song 12
Song 13
Song 14
Song 15
Song 16
Song 17
Song 18
Song 19
Song 20

Javascript error logging

If something goes wrong with my site, I like to know of it. My top three problems are:

  1. The site is down
  2. A page is missing
  3. Javascript isn’t working

This is the last of 3 articles on these topics.

I am a bad programmer

I am not a professional developer. In fact, I’m not a developer at all. I’m a management consultant. (Usually, it’s myself I’m trying to convince.)

Since no one pays me for what little code I write, no one shouts at me for getting it wrong. So I have a happy and sloppy coding style. I write what I feel like, and publish it. I don’t test it. Worse, sometimes, I don’t even run it once. I’ve sent little scripts off to people which wouldn’t even compile. I make changes to this site at midnight, upload it, and go off to sleep without checking if the change has crashed the site or not.

But no one tells me so

At work, that’s usually OK. On the few occasions where I’ve written Perl scripts or VB Macros that don’t work, people call me back within a few hours, very worried that THEY’d done something wrong. (Sometimes, I don’t contradict them.)

On my site, I don’t always get that kind of feedback. People just click the back button and go elsewhere.

Recently, I’ve been doing more Javascript work on my site than writing stuff. Usually, the code works for me. (I write it for myself in the first place.) But I end up optimising for Firefox rather than IE, and for the plugins I have, etc. When I try the same app a few months later on my media PC, it doesn’t work, and shockingly enough, no one’s bothered telling me about it all these months. They’d just click, nothing happens, they’d vanish.

But their browsers can tell me

The good part about writing code in Javascript is that I can catch exceptions. Any Javascript error can be trapped. So since the end of last year, I’ve started wrapping almost every Javascript function I write in a try {} catch() {} block. In the catch block, I send a log message reporting the error.

The code looks something like this:

function log(e, msg) {
    for (var i in e) { msg += i + "=" + e[i] + "\n"; }
    (new Image()).src="log.pl?m=" + encodeURIComponent(msg);
}

function abc() {
    try {
    // ... function code
    } catch(e) { log(e, "abc"); }
}

Any time there’s an error in function abc, the log function is called. It sends the function name ("abc") and the error details (the contents of the error event) to log.pl, which stores the error, along with details like the URL, browser, time and IP address. This way, I know exactly where what error occurs.

This is a fantastic for a three reasons.

  • It tells me when I’ve goofed up. This is instantaneous feedback. I don’t have to wait for a human. If you run my program on your machine, and it fails, I get to know immediately. (Well, as soon as I read the error log, at least.)
  • It tells me where I’ve goofed up. The URL and the function name clearly indicate the point of failure.
  • It tells me why I’ve goofed up. Almost. Using the browser name and the error message, I can invariably pinpoint the reason for the error. Then it’s just a matter of taking the time to fix it.

I’d think this sort of error reporting should be the norm for any software. At least for a web app, given how easy it is to implement.

Monitoring site downtime

If something goes wrong with my site, I like to know of it. My top three problems are:

  1. The site is down
  2. A page is missing
  3. Javascript isn’t working

I’ll talk about how I manage these over 3 articles.

My site used to go down a lot. Initially that was because I kept playing around with mod_rewrite and other Apache modules without quite understanding them. I’d make a change and upload it without testing. (I’m like that.) And then I’d go to sleep.

Next morning, the site’s down, and has been down all night.

This is a bit annoying. Partly because I couldn’t use my site, but mostly because of the Oh yeah, sorry — I goofed up last night replies that I have to send out the next day.

So I started using Site24x7 to track if my website was down. It’s a convenient (and free) service. It pings my site every hour. If it’s down, I get an SMS. If it’s back up, I get an SMS. It also keeps a history of how often the site is down.

Site24x7

Over time, I stopped making mistakes. But my site still kept going down, thanks to my hosting service (100WebSpace). When I goof up, it’s just an annoyance, and I can fix it. But when my hosting service goes down, it’s more than that. My site is where I listen to music, read comics, read RSS feeds, use custom search engines, watch movies, browse for books, etc. Not being able to do these things — nor fix the site — is suffocating.

Worse, I couldn’t sleep. I use my mobile as my alarm. It’s annoying to hear an SMS from under your pillow at 3am every day — especially if it says your site is down.

So I switched to HostGator a few months ago. Nowadays, the site is down a lot less. (In times of trouble, it becomes sluggish, but doesn’t actually go down.)

That came at a cost, though. I was paying 100 WebSpace about $25 per annum. I’m paying Hostgator about $75 per annum. Being the kind that analyses purchases to death, the big question for me was, is this worth it. There is where my other problem with the site being down kicks in. I get a bit of ad revenue from my site, and I lose that when the site’s down. (Not that it’s much. Still…)

According to Site24x7, my site was up ~98% of the time. So I’m losing about 2% of my potential ad revenue. For the extra $50 to be worth it, my ad revenue needs to be more than $50 / 2% = $2,500 per annum. I’m nowhere near it. So the switch isn’t actually a good idea economically, but it does make life convenient (which is pretty important) and I sleep better (much more so).

The important thing, I’ve realised, is not just to track this stuff. That’s useful, sure. But what really made Site24x7 useful to me is that it would alert me when there was a big problem.

There are many kinds of alerting.

There’s a report you can view whenever you remember to view it. (It could be an RSS feed, so at least you won’t have to remember the site. But you still need to read your feeds.)

Then there’s the more pushy alerting: sending you an e-mail. That may catch you instantly for the half of the day that you’re online. Or, if you’re like me, it may completely escape your attention. (I don’t read e-mail.)

And then there’s the equivalent of shaking you by the shoulder — getting an SMS. (At least, that’s how it is for me. Incidentally, I don’t reply to SMS either. Calling me gets a reply. Nothing else.)

The type of alerting is clearly a function of the severity of the problem. Wake me up when my site goes down. Don’t wake me up if a link is broken.

Site24x7 sends me an SMS when my site is down. Fits the bill perfectly.

Handling missing pages

If something goes wrong with my site, I like to know of it. My top three problems are:

  1. The site is down
  2. A page is missing
  3. Javascript isn’t working

This article covers the second topic.

One thing I’m curious about is hits to non-existent pages (404s) on my site. I usually get 404s because:

  • I renamed the page
  • Someone typed a wrong URL
  • Someone followed a wrong link

Find the 404

The first problem is to know when someone gets a 404. I’ve seen sites that tell you to contact the administrator in case of a 404. That’s crazy. The administrator should automatically detect of 404s! Almost every web server provides this facility.

The real issue is attention. I receive 700 404s a day. That’s too much to manually inspect. And most of these are not for proper web pages, but for images (for example, almost all my 404s used to be for browsers requesting favicon.ico) or weird MS Office files.

I’m interested in a small subset of 404 errors. Those that hit a web page, not support files. And those requested by a human, not a search engine or a program.

A decent way of filtering these is to use Javascript in your 404 page. Javascript is typically executed only by browsers (i.e. humans, not search engines), and only in a web page (not images, etc.) So if you serve Javascript in your 404 page, and it gets executed, it’s likely to be a human requesting a web page.

I have a piece of Javascript in my custom 404 page that looks something like this:

<script>(new Image()).src = "/log.pl";</script>

Every time this code runs, it loads a new image. The source of the image is a Perl script, log.pl. Every time log.pl is accessed, it logs the URL from which it was called. I’m reasonably guaranteed that these are web pages a human tried to access.

The reduction in volume is tremendous. On a typical month, I get ~20,000 404 errors. With the Javascript logging, it’s down to around 200 a month, and most of them quite meaningful.

Point to the right page

Sometimes, the change happens because I changed the URLs. I keep fiddling with the site structure. Someone would have links to an old page that I’ve renamed. I may not even know that. Even if I did, they can’t be bothered with requests to change the link. So I’ve got to handle it.

The quickest way, I find, is to use Apache’s mod_rewrite. You can simply redirect the old URL to the new URL. For example, I used to have a link to /calvin.html which I now point to /calvinandhobbes.html. That becomes a simple line on my .htaccess file:

RewriteRule calvin.html  calvinandhobbes.html

I don’t do this for every site restructuring, though. I just restructure, wait for someone to request a wrong page, and when my 404 error log warns me, I create a line in the .htaccess. It keeps the redirections down to a minimum, and only for those links that are actually visited.

Be flexible with the URL structure

Sometimes people type in a wrong link. Often, these are unintentional. Here are some common misspellings for my Hindi songs search.

s-anand.net/hindi/
s-anand.net/Hindi
s-anand.net/hiundi

Occasionally, people are exploring the structure of my site:

s-anand.net/excel
s-anand.net/music
s-anand.net/hits

I need to decide what to do with both cases. For the former, sometimes my URL structure is too restrictive. I mean, why should someone have to remember to type /hindi instead of /Hindi or /hindi/? Who cares about case? Who cares about a trailing slash?

In such cases, I map all the variants to the right URL using mod_rewrite. For example, typing s-anand.net/HiNDi (with or without caps, with or without a slash at the end) will still take you to the right page.

As I keep discovering new mis-spellings, I take a call on whether to add it or not. The decision is usually based on volume. If two people make the same spelling mistake in a day, I almost certainly add the variant. Most of the time, it’s just typing errors like /hiundi which isn’t repeated oftener than once a month.

Provide search

To handle the exploratory URLs, and people following wrong links, I’ve turned my custom 404 page into a search engine.

For example, when someone types s-anand.net/excel, I know they’re searching for Excel. So I just do a Google Custom Search within my site for “excel” — that is, anything following the URL.

It’s a bit more complex than that, actually. I do a bit of tweaking to the URL, like convert punctuations (underscore, hyphen, full-stop, etc.) to spaces, remove common suffixes (.html, .htm) and ignore numbers. Quite often, it matches something on my site that they’re looking for. If not, ideally, I ought to try for various alternatives and subsets of the original search string to figure out a good match. But given that the number of mismatches is down to about one a day, I’m fairly comfortable right now.

What this means, incidentally, is that my site is, by default, a search engine for itself. To search for movie-related stuff on my site, just type s-anand.net/movie and you get a search of the word “movie” on my site. (Sort of like on a9.com, where searching for a9.com/keyword does a search on the keyword.)

Managing feed overload

I have only two problems with Google Reader.

The first is that it doesn’t support authenticated feeds. Ideally, I’d have liked to have a single reading list that combines my e-mail with newsfeeds. GMail offers RSS feeds of your e-mail. But the feeds require authentication (obviously) and Google Reader doesn’t support that right now. (So I usually don’t read e-mail 🙂

The second is that it’s tough to manage large feeds. It’s a personal quirk, really. I like to read all entries. If there are 100, I read all 100. If there are 1000, I struggle but read all 1000. I’m too scared to “Mark all read” because there are some sources that I don’t want to miss.

The 80-20 rule is at work here. There are some prolific writers (like Scoble) who write many entries a day. There are some prolific sources (del.icio.us or digg). I can’t keep up with such writers / sources. I don’t particularly want to. If I missed one day of del.icio.us popular items, I’ll just read the next day’s.

With Google Reader, that makes me uneasy. I don’t like having 200 unread items. I don’t like to mark them all read.

In such cases, popurls‘ approach is useful. It shows the top 15-30 entries of the popular sites as a snapshot. Any time you’re short of things to read, visit this. If you’re busy, don’t.

Using Google’s AJAX Feed API, it’s quite trivial to build your own feed reader. So I cloned popurls‘ layout into my bookmarks page, and put in feeds that I like.

You can customise my bookmarks page to use your own feeds. Save the page, open it in Notepad, and look for existing feeds. They’ll look like this:

"hacker news" : {
    entries:15,
    url:"http://news.ycombinator.com/rss"
},

The first line (“hacker news”) is the title of the feed. You can call it what you want. Set entries to the number of feed entries you want to show. Set url to the RSS feed URL. Save it, and you have your own feed reader. (If you want to put it up on your site, you may want to change the Google API key.)

Try it! Just save this page and edit the feeds.


Here, I must point out three things about Google’s AJAX Feed API that make it extremely powerful.

The obvious is that is allows Javascript access to RSS in a very easy way. That makes it very easy to integrate with any web page.

The second is subtler — it includes historical entries. So even if an RSS feed had only 10 entries, I could pick up the last 100 or 1,000, as long as Google has known about the feed for long enough. This is what makes Google Reader more of a platform rather than a simple feed reader application. Google Reader is a feed archiver — not just a feed reader.

The third (I’m a bit crazy here) is that you can use it to schedule tasks. Google’s FeedFetcher refreshes feeds every 3 hours or so. If you want to do something every three hours (or some multiple thereof — say every 24 hours), you can write a program that does what you want, and subscribe to it’s output as a feed.

You may notice that I have a Referrers to s-anand.net on my bookmarks page. These are the sites that someone clicked on to visit my site. I have a PHP application that searches my access log for external referrers. Rather than visit that page every time, I just made an RSS feed out of it and subscribed to it. Every three hours or so, Google accesses the URL. I search my access.log and archives the latest results. So, even after my access.log is trimmed by the server, I have it all on Google Reader to catch up with later.

Since Google doesn’t forget to ping, I can schedule some fairly time-critical processes this way. For instance, if I wanted to download each Dilbert strip, every day as it arrives, I can create an application that downloads the file and returns a feed entry. Now, I don’t need to remember to run it every day! I just subscribe to the application on Google Reader, and Google will remind the application to run every 3 hours. (I know — I could use a crontab, but somehow, I like this.) Plus I would get the Dilbert strip on my feed reader as a bonus.


Update: Google has just released PartnerBar, which is a more flexible way of viewing a snapshot of feeds.

Scraping RSS feeds using XPath

If a site doesn’t have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what’s changed on a page.

My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don’t offer a feed. I don’t want to track all the other junk on that page. Just the top 250.

There’s a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:

//a Matches all <a> links
//p/b Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>)
//table//a Matches all links inside a table (the links need not be immediately inside the table — anywhere inside the table works)

You get the idea. It’s like a folder structure. / matches the a tag that’s immediately below. // matches a tag that’s somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it — it’s much easier to try it than to read the documentation.

The following XPath matches the IMDb Top 250 exactly.

//tr//tr//tr//td[3]//a

(It’s a link inside the 3rd column in a table row in a table row in a table row.)

Now, all I need is to get something that converts that to an RSS feed. I couldn’t find anything on the Web, so I wrote my own XPath server. The URL:

www.s-anand.net/xpath?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a

When I subscribe to this URL on Google Reader, I get to know whenever there’s a new movie on the IMDb Top 250.

This gives only the names of the movies, though, and I’d like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:

xpath=//tr//tr//tr title->./td[3]//a link->./td[3]//a/@href

This says three things:

//tr//tr//tr Pick all rows in a row in a row
title->./td[3]//a For each row, set the title to the link text in the 3rd column
link->./td[3]//a … and the link to the link href in the 3rd column

That provides a more satisfactory RSS feed — one that I’ve subscribed to, in fact. Another one that I track is a list of mininova top seeded movies category.

You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!

Advanced Google Reader

I’ve stopped visiting websites.

No, really. There’s only one website I visit these days. Google Reader.

Google Reader screenshot

Google Reader is a feed reader. If you want to just catch up on the new stuff on a site, you can add the site to Google Reader. Anything new that is published on the site appears in Google Reader. Right now, I’ve subscribed to over 50 feeds. There’s no way I can remember to visit 50 sites — so I’m actually able to read more and miss less.

In a sense, that’s the essence of using feeds to read stuff: remember less, read more and miss less. If I ever find an interesting site, I just add it to Google Reader and forget about it.

But it goes beyond that. You can subscribe to search results.

Videos, for examples. Subscribe to a search for google engedu on Google Video. Any new TechTalk at Google, you get it in your Reader. (The talks are fabulous by the way.) Or search for hindi movies.

Photos? Track new photos of the Indian cricket team on Flickr.

Movies? Get an update of the latest movie torrents from mininova or torrentspy. Or just get updates on American Gangster DVDRip.

Presentations? Track anything on Javascript or consulting from SlideShare.

Documents? Find new eBooks on Scribd.

Not all pages offer a feed. But Page2Rss offers a reasonable solution. It converts pretty much any page into a feed.

It’s gotten to the point where anything I know I want to read, I put it on Google Reader. The rest of the Internet is when I don’t know what I want to read.

Website load distribution using Javascript

My music search engine shows a list of songs as you type — sort of like Google’s autosuggest feature. I load my entire list of songs upfront for this to work. Though it’s compressed to load fast, each time you load the page, it downloads about 500KB worth of song titles.

My allotted bandwidth on my hosting service is 3GB per month. To ensure I don’t exceed it, I uploaded the songs list to an alternate free server: Freehostia. This keeps my load down. If I exceed Freehostia’s limit, my main site won’t be affected — just the songs. I also uploaded half of them to Google Pages, to be safe.

This all worked fine… until recently. Google Pages has a relatively low bandwidth restriction. (Not sure what, and they won’t reveal it, but my site is affected.) Freehostia is doing some maintenance, and their site goes down relatively often. So my song search goes down when any of these go down.

Now, these are rarely down simultaneously. Just one or the other. But whenever Freehostia is down, I can’t listen to one bunch of songs. When Google Pages is down, I can’t listen to another.

What I needed was a load distribution set-up. So I’ve made one in JavaScript.

Normally, I load the song list using an external javascript. I have a line that says:

<script src="http://sanand.freehostia.com/songs/..."></script>

… and the song’s loaded from Freehostia.

What I’d like to do is:

loadscripts(
    "list.hasLoaded()", [
    "http://sanand.freehostia.com/songs/...",
    "http://root.node.googlepages.com/...",
    "http://www.s-anand.net/songs/..."
])

If the function can’t load it from the first link, it loads it from the second, and so on, until list.hasLoaded() returns true.

Here’s how I’ve implemented the function:

function loadscripts(check, libs) {
    document.write('<script src="' + libs.shift() +
        '"><' + '/script>');

    if (libs.length>0) document.write('<script>if (!(' + check + ')) {'
        + 'loadscripts("' + check + '",["' + libs.join('","')+'"])'
        + '}</' + 'script>')
}

The first document.write loads the first script in the list. The if condition checks if there’s more scripts to load. If yes, the second document.write writes a script that

  • checks whether the script is already loaded, and
  • if not, loads the rest of the scripts using the same function.

I’ve expanded the sites that have these free songs as well. So now, as long as my main site works, and at least one of the other sites work, the search engine will work.

PS: You can easily expand this to do random load distribution as well.