I read about Google AppEngine early this morning, and applied for an invite. Google's issuing beta invites to the first 10,000 users. I was pretty convinced I wasn't among those, but turns out I was lucky.
AppEngine lets you write web apps that Google hosts. People have been highlighting that it give you access to the Google File System and BigTable for the first time. But to me, that isn't a big deal. (I'm not too worried about reliability, and MySQL / flat files work perfectly well for me as a data store.)
What's more interesting unlike Amazon's EC2 and S3, this is free up to a certain quota. And you get a fair bit of processing power and bandwidth for free. One of the reasons I've held back on creating some apps was simply because it would take away too much bandwidth / CPU cycles from my site. (I've had this problem before.) Google quota is 10 GB of bandwidth per day (which is about 30 times what my site uses). And this is on Google's incredibly fast servers It also offers 200 million megacycles a day. That's like a dedicated 2.3 GHz processor -- better, because this is the average capacity, not peak capacity. The only restriction that really worries me is that only 3 apps are allowed per developer.
So I decided to give a shot at publishing some code I'd kept in reserve for a long time. You may remember my statistical analysis of Calvin & Hobbes. For this, I'd created a script in Perl that could generate SIPs for any text. This is based on (a somewhat limited) 23MB corpus of ebooks that I had. I'd wanted to put that up on my website, but ...
AppEngine only uses Python. So the first task was to get Python, and then to learn Python. The only saving grace was that I was just cutting-and-pasting most of the time. Google wasn't helping:
Anyway, the site is up. You can view it at sip.s-anand.net for now. Just type a URL, and it'll tell you the improbable words in that site.
Technical notes
I realise that these are statistically improbable words, not phrases. I'll get to the phrases in a while.
The logic is simple:
The source code is here.
Update: 12-Apr-2008. I've added some interactivity. You can play with the contrast and font size, the filter out common or infrequent words.
Update: 22-Apr-2008. Added concordance. You can click on a word and see the context in which it appears.