Client side scraping

“Scraping” is extracting content from a website. It’s often used to build something on top of the existing content. For example, I’ve built a site that tracks movies on the IMDb 250 by scraping content. There are libraries that simplify scraping in most languages: Perl: WWW::Mechanize Python: BeautifulSoup Ruby: HPricot PHP: XPath (built-in) Javascript: jQuery on env.js on Rhino But all of these are on the server side. That is, the program scrapes from your machine. Can you write a web page where the viewer’s machine does the scraping? ...

Infyblogs dashboard

I just finished Stephen Few’s book on Information Dashboard Design. It talks about what’s wrong with the dashboards most Business Intelligence vendors (Business Objects, Oracle, Informatica, Cognos, Hyperion, etc.), and brings Tuftian principles of chart design to dashboards. So I took a shot at designing a dashboard based on those principles, and made this dashboard for InfyBLOGS. You can try for yourself. Go to http://www.s-anand.net/reco/ Note: This only works within the Infosys intranet. Right click on the “Infyblog Dashboard” link and click “Add to Favourites…” (Non-IE users – drag and drop it to your links bar) If you get a security alert, say “Yes” to continue Return to InfyBLOGS, make sure you’re logged in (that’s important) and click on the “Infyblog Dashboard” bookmark You’ll see a dashboard for your account, with comments and statistics The rest of this article discusses design principles and the technology behind the implementation. (It’s long. Skim by reading just the bold headlines.) ...

To Python from Perl

I’ve recently switched to Python, after having programmed in Perl for many years. I’m sacrificing all my knowledge of the libraries and language quirks of Perl. The reason I moved despite that is for a somewhat trivial reason, actually. It’s because Python doesn’t require a closing brace. Consider this Javascript (or very nearly C or Java) code: var s=0; for (var i=0; i<10; i++) { for (var j=0; j<10; j++) { s = s + i * j } } That’s 6 lines, with two lines just containing the closing brace. Or consider Perl. ...

Bound methods in Javascript

The popular way to create a class in Javascript is to define a function and add methods to its prototype. For example, let’s create a class Node that has a method hide(). var Node = function(id) { this.element = document.getElementById(id); }; Node.prototype.hide = function() { this.style.display = "none"; }; If you had a header, say Heading, then this piece of code will hide the element. var node = new Node("header"); node.hide(); If I wanted to hide the element a second later, I am tempted to use: var node = new Node("header"); setTimeout(node.hide, 1000); … except that it won’t work. setTimeout has no idea that the function node.hide has anything to do with the object node. It just runs the function. When node.hide() is called by setTimeout, the this object isn’t set to node, it’s set to window. node.hide() ends up trying to hide window, not node. ...

Downloading online songs

You know those songs on Raaga, MusicIndiaOnline, etc? The ones you can listen to but can’t download? Well, you can download them. It’s always been possible to download these files. After all, that’s how you get to listen to them in the first place. What stopped you is security by obscurity. You didn’t know the location where the song was stored, but if you did, you could download them. So how do you figure out the URL to download the file from? ...

Keyword searches as a Web command line

Andre’s mentions dumping Google Chrome because of lack of extension support, especially Ubiquity, and lists 15 useful Ubiquity commands. If you haven’t seen Ubiquity, you should. It’s a great extension that transforms your browser into an Internet command prompt. It is modelled on the Enso Launcher, which is a great piece of work by itself. I wasn’t quite prepared to let go of Chrome that easily. On Task Manager, seeing 10 Chrome processes, the largest of which takes up 60MB, is a lot more comforting, psychologically, than 1 Firefox process taking up 300MB. (I rarely hit my 1GB RAM limit, so it shouldn’t matter either way. Yet, the spendthrift in me keeps watching.) ...

Caching pages on Apache

I don’t use any blogging software for my site. I just hand-wired it some years ago. When doing this, one of the biggest problems was caching. Consider each blog entry page. Each page has the same template, but different content. Both the template and content could be changed. So ideally, blog pages should be served dynamically. That is, every time someone requests the page, I should look up the content, look up the template, and put them together. ...

In search of a good editor

It's amazing how hard it is to get a good programming editor. I've played around with more editors/IDEs than I care to remember: e Notepad++ NoteTab SciTE Crimson Editor Komodo Eclipse Aptana ... There are four features that are critical to me. Syntax highlighting. Over time, I've found this to increase readability dramatically. Look at this piece of code with and without syntax highlighting: Doesn't the structure of the document just jump out with syntax highlighting? Anyway, I've gotten used to that. Column editing. I want to be able to do this: Being able to type across rows is incredibly useful. I use it both for programming as well as to complement data-processing on Excel. Unicode support. I often work with non-ASCII files, particularly in Tamil. Unicode support comes in handy when debugging pages for my songs site. Auto-completion. This is 10 times more productive than having to look up the manual for each function. (Oh, and it's got to be free too. Except for e Text Editor, all the others qualify.) ...

JPath - XPath for Javascript

XPath is a neat way of navigating deep XML structures. It's like using a directory structure. /table//td gets all the TDs somewhere below TABLE. Usually, you don't need this sort of a thing for data structures, particularly in JavaScript. Something like table.td would already work. But sometimes, it does help to have something like XPath even for data structures, so I built a simple XPath-like processor for Javascript called JPath. Here are some examples of how it would work: ...

Automating Internet Explorer with jQuery

Most of my screen-scraping so far has been through Perl (typically WWW::Mechanize). The big problem is that it doesn't support Javascript, which can often be an issue: The content may be Javascript-based. For example, Amazon.com shows the bestseller book list only if you have Javascript enabled. So if you're scraping the Amazon main page for the books bestseller list, you won't get it from the static HTML. The navigation may require Javascript. Instead of links or buttons in forms, you might have Javascript functions. Many pages use these, and not all of them degrade gracefully into HTML. (Try using Google Video without Javascript.) The login page uses Javascript. It creates some crazy session ID, and you need Javascript to reproduce what it does. You might be testing a Javascript-based web-page. This was my main problem: how do I automate testing my pages, given that I make a lot of mistakes? There are many approaches to overcoming this. The easiest is to use Win32::IE::Mechanize, which uses Internet Explorer in the background to actually load the page and do the scraping. It's a bit slower than scraping just the HTML, but it'll get the job done. ...

Statistically improbable phrases on Google AppEngine update

I’ve added some interactivity to the Statistically improbable phrases application. You can now: Filter out stopwords Dynamically filter infrequent words and commonly used words Dynamically play with the contrast and font size Comments Srikanth 12 Apr 2008 12:00 pm: Dear sir, I was searching for Ilayaraja songs and came across your wonderful compilation of 15 wonderful articles. Good one. Please do write more on music. Collin 12 Apr 2008 12:00 pm: I love this application. Because now, I can create a url to NY Times, and see what is the main subject of the day. :) S Anand 12 Apr 2008 12:00 pm: Thanks, Colin! Spencer 12 Apr 2008 12:00 pm: I was curious as to whether or not I could use this pointed into a specific personal corpus to separate documents from one another.

Statistically improbable phrases on Google AppEngine

I read about Google AppEngine early this morning, and applied for an invite. Google’s issuing beta invites to the first 10,000 users. I was pretty convinced I wasn’t among those, but turns out I was lucky. AppEngine lets you write web apps that Google hosts. People have been highlighting that it give you access to the Google File System and BigTable for the first time. But to me, that isn’t a big deal. (I’m not too worried about reliability, and MySQL / flat files work perfectly well for me as a data store.) ...

Chaining functions in Javascript

One of the coolest features of jQuery is the ability to chain functions. The output of a function is the calling object. So instead of writing: var a = $("<div></div>"); a.appendTo($("#id")); a.hide(); … I can instead write: $("<div></div>").appendTo($("#id")).hide(); A reasonable number of predefined Javascript functions can be used this way. I make extensive use of it with the String.replace function. But where this feature is not available, you an create it in a fairly unobstrusive way. Just add this code to your script: ...

Javascript error logging

If something goes wrong with my site, I like to know of it. My top three problems are: The site is down A page is missing Javascript isn’t working This is the last of 3 articles on these topics. I am a bad programmer I am not a professional developer. In fact, I’m not a developer at all. I’m a management consultant. (Usually, it’s myself I’m trying to convince.) Since no one pays me for what little code I write, no one shouts at me for getting it wrong. So I have a happy and sloppy coding style. I write what I feel like, and publish it. I don’t test it. Worse, sometimes, I don’t even run it once. I’ve sent little scripts off to people which wouldn’t even compile. I make changes to this site at midnight, upload it, and go off to sleep without checking if the change has crashed the site or not. But no one tells me so At work, that’s usually OK. On the few occasions where I’ve written Perl scripts or VB Macros that don’t work, people call me back within a few hours, very worried that THEY’d done something wrong. (Sometimes, I don’t contradict them.) It can be quite a stressful experience but good thing you can learn more here on how to cope up with it. On my site, I don’t always get that kind of feedback. People just click the back button and go elsewhere. Recently, I’ve been doing more Javascript work on my site than writing stuff. Usually, the code works for me. (I write it for myself in the first place.) But I end up optimising for Firefox rather than IE, and for the plugins I have, etc. When I try the same app a few months later on my media PC, it doesn’t work, and shockingly enough, no one’s bothered telling me about it all these months. They’d just click, nothing happens, they’d vanish. But their browsers can tell me The good part about writing code in Javascript is that I can catch exceptions. Any Javascript error can be trapped. So since the end of last year, I’ve started wrapping almost every Javascript function I write in a try {} catch() {} block. In the catch block, I send a log message reporting the error. The code looks something like this: ...

Scraping RSS feeds using XPath

If a site doesn't have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what's changed on a page. My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don't offer a feed. I don't want to track all the other junk on that page. Just the top 250. There's a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples: ...

Website load distribution using Javascript

My music search engine shows a list of songs as you type – sort of like Google’s autosuggest feature. I load my entire list of songs upfront for this to work. Though it’s compressed to load fast, each time you load the page, it downloads about 500KB worth of song titles. My allotted bandwidth on my hosting service is 3GB per month. To ensure I don’t exceed it, I uploaded the songs list to an alternate free server: Freehostia. This keeps my load down. If I exceed Freehostia’s limit, my main site won’t be affected – just the songs. I also uploaded half of them to Google Pages, to be safe. ...