JPath - XPath for Javascript

XPath is a neat way of navigating deep XML structures. It's like using a directory structure. /table//td gets all the TDs somewhere below TABLE. Usually, you don't need this sort of a thing for data structures, particularly in JavaScript. Something like table.td would already work. But sometimes, it does help to have something like XPath even for data structures, so I built a simple XPath-like processor for Javascript called JPath. Here are some examples of how it would work: ...

Automating Internet Explorer with jQuery

Most of my screen-scraping so far has been through Perl (typically WWW::Mechanize). The big problem is that it doesn't support Javascript, which can often be an issue: The content may be Javascript-based. For example, Amazon.com shows the bestseller book list only if you have Javascript enabled. So if you're scraping the Amazon main page for the books bestseller list, you won't get it from the static HTML. The navigation may require Javascript. Instead of links or buttons in forms, you might have Javascript functions. Many pages use these, and not all of them degrade gracefully into HTML. (Try using Google Video without Javascript.) The login page uses Javascript. It creates some crazy session ID, and you need Javascript to reproduce what it does. You might be testing a Javascript-based web-page. This was my main problem: how do I automate testing my pages, given that I make a lot of mistakes? There are many approaches to overcoming this. The easiest is to use Win32::IE::Mechanize, which uses Internet Explorer in the background to actually load the page and do the scraping. It's a bit slower than scraping just the HTML, but it'll get the job done. ...

Statistically improbable phrases on Google AppEngine update

I’ve added some interactivity to the Statistically improbable phrases application. You can now: Filter out stopwords Dynamically filter infrequent words and commonly used words Dynamically play with the contrast and font size Comments Srikanth 12 Apr 2008 12:00 pm: Dear sir, I was searching for Ilayaraja songs and came across your wonderful compilation of 15 wonderful articles. Good one. Please do write more on music. Collin 12 Apr 2008 12:00 pm: I love this application. Because now, I can create a url to NY Times, and see what is the main subject of the day. :) S Anand 12 Apr 2008 12:00 pm: Thanks, Colin! Spencer 12 Apr 2008 12:00 pm: I was curious as to whether or not I could use this pointed into a specific personal corpus to separate documents from one another.

Statistically improbable phrases on Google AppEngine

I read about Google AppEngine early this morning, and applied for an invite. Google’s issuing beta invites to the first 10,000 users. I was pretty convinced I wasn’t among those, but turns out I was lucky. AppEngine lets you write web apps that Google hosts. People have been highlighting that it give you access to the Google File System and BigTable for the first time. But to me, that isn’t a big deal. (I’m not too worried about reliability, and MySQL / flat files work perfectly well for me as a data store.) ...

Chaining functions in Javascript

One of the coolest features of jQuery is the ability to chain functions. The output of a function is the calling object. So instead of writing: var a = $("<div></div>"); a.appendTo($("#id")); a.hide(); … I can instead write: $("<div></div>").appendTo($("#id")).hide(); A reasonable number of predefined Javascript functions can be used this way. I make extensive use of it with the String.replace function. But where this feature is not available, you an create it in a fairly unobstrusive way. Just add this code to your script: ...

Javascript error logging

If something goes wrong with my site, I like to know of it. My top three problems are: The site is down A page is missing Javascript isn’t working This is the last of 3 articles on these topics. I am a bad programmer I am not a professional developer. In fact, I’m not a developer at all. I’m a management consultant. (Usually, it’s myself I’m trying to convince.) Since no one pays me for what little code I write, no one shouts at me for getting it wrong. So I have a happy and sloppy coding style. I write what I feel like, and publish it. I don’t test it. Worse, sometimes, I don’t even run it once. I’ve sent little scripts off to people which wouldn’t even compile. I make changes to this site at midnight, upload it, and go off to sleep without checking if the change has crashed the site or not. But no one tells me so At work, that’s usually OK. On the few occasions where I’ve written Perl scripts or VB Macros that don’t work, people call me back within a few hours, very worried that THEY’d done something wrong. (Sometimes, I don’t contradict them.) It can be quite a stressful experience but good thing you can learn more here on how to cope up with it. On my site, I don’t always get that kind of feedback. People just click the back button and go elsewhere. Recently, I’ve been doing more Javascript work on my site than writing stuff. Usually, the code works for me. (I write it for myself in the first place.) But I end up optimising for Firefox rather than IE, and for the plugins I have, etc. When I try the same app a few months later on my media PC, it doesn’t work, and shockingly enough, no one’s bothered telling me about it all these months. They’d just click, nothing happens, they’d vanish. But their browsers can tell me The good part about writing code in Javascript is that I can catch exceptions. Any Javascript error can be trapped. So since the end of last year, I’ve started wrapping almost every Javascript function I write in a try {} catch() {} block. In the catch block, I send a log message reporting the error. The code looks something like this: ...

Scraping RSS feeds using XPath

If a site doesn't have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what's changed on a page. My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don't offer a feed. I don't want to track all the other junk on that page. Just the top 250. There's a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples: ...

Website load distribution using Javascript

My music search engine shows a list of songs as you type – sort of like Google’s autosuggest feature. I load my entire list of songs upfront for this to work. Though it’s compressed to load fast, each time you load the page, it downloads about 500KB worth of song titles. My allotted bandwidth on my hosting service is 3GB per month. To ensure I don’t exceed it, I uploaded the songs list to an alternate free server: Freehostia. This keeps my load down. If I exceed Freehostia’s limit, my main site won’t be affected – just the songs. I also uploaded half of them to Google Pages, to be safe. ...