Tools

Dilbert search engine

Wouldn’t it be cool to be able to search through the Dilbert archives using text?

This used to be possible at Dilbert.com some years ago, as a paid service. In late 2003, I needed to find some Dilbert strips for a client, so I’d subscribed for a year. I could then search for the quotes (I happened to be looking for "outsourcing", so you can guess the context).

But I can’t seem to find the feature any more, even as a paid service. The site looks a lot better, of course. But I can’t find strips.

Well, why not type them out? After all, I’d done that with Calvin and Hobbes.

This would be a much larger exercise, though. And I’m hoping to take your help. I’ve set up a site at dilbert-search.appspot.com. You can type in a comic randomly, starting from 2000. These will be made searchable on my Dilbert page. You can export the data and use it yourself, of course.

When typing in Calvin and Hobbes, I did have a few volunteers willing to pitch in, but collaboration tools weren’t easy to set up, and I ended up typing the whole thing myself. This time, I’d be delighted if even 10 people typed in just a strip each.


So, here’s my request, to all you Dilbert fans.

  1. Please go to dilbert-search.appspot.com
  2. Log in using your Google account and type in as many strips as you like
  3. Bookmark it for the future, whenever you’re bored

As I said, the data is readily exportable from the page, so if you’re looking to do cool mash-ups with it, great! And if you want the data exported in other formats, please let me know.

Incidentally, I created the site using Google AppEngine. The source code is at dilbert-search.googlecode.com.

Animated charts in Excel

Watch Hans Rosling‘s TED Talks on debunking third world myths and new insights on poverty and ask yourself: could I do this with my own data?

Yes. Google has a gadget called MotionChart that lets you do this.

Now, you could put this up on your web page, but that’s not quite useful when presenting to a client. (It is shocking, but there are many practical problems getting an Internet connection at a client site. The room doesn’t have a connection. The cable isn’t long enough. You can’t access the LAN. Their proxy requires authentication. The connection is too slow. Whatever.)

So you need this in Excel. Let me explain a variant of the technique I described earlier.

Let’s start by creating a simple bubble chart.

For each item in a bubble chart, you need 3 pieces of data: the X-axis, Y-axis and size. This graph shows three items A, B and C in one year: 2001. To animate this, you need data for more years, so let’s create that.

The first 3 rows contain the same data as before, except that I’ve added a "Year" column and a "Key" column (which is just a concatenation of the Year and the Item). The data now goes on for many more years.

Now we need to create a scroll bar that can be used to change the year. So add a scroll bar below the bubble chart…

… and right click the scroll bar and go to Format Control. Now, select the cell link to some cell ($H$1 in this case). Now, if you move the scroll bar, the cell value will change.

All you need to do is to now change the source data for the chart based on the year. From the table on the left, VLOOKUP the year + item, and put this into the table on the right. When the year in the cell H1 changes, the data updates itself. So now, as you move the scroll bar, cell H1 changes, then so does the data and hence the graph.

This is what the animation looks like.

And here’s the Excel file.

Statistically improbable phrases on Google AppEngine

I read about Google AppEngine early this morning, and applied for an invite. Google’s issuing beta invites to the first 10,000 users. I was pretty convinced I wasn’t among those, but turns out I was lucky.

AppEngine lets you write web apps that Google hosts. People have been highlighting that it give you access to the Google File System and BigTable for the first time. But to me, that isn’t a big deal. (I’m not too worried about reliability, and MySQL / flat files work perfectly well for me as a data store.)

What’s more interesting unlike Amazon’s EC2 and S3, this is free up to a certain quota. And you get a fair bit of processing power and bandwidth for free. One of the reasons I’ve held back on creating some apps was simply because it would take away too much bandwidth / CPU cycles from my site. (I’ve had this problem before.) Google quota is 10 GB of bandwidth per day (which is about 30 times what my site uses). And this is on Google’s incredibly fast servers It also offers 200 million megacycles a day. That’s like a dedicated 2.3 GHz processor (200 million megacycles = 200,000 GHz x 1 second ~ 2.3 GHz x 86,400 seconds/day) — better, because this is the average capacity, not peak capacity. The only restriction that really worries me is that only 3 apps are allowed per developer.

So I decided to give a shot at publishing some code I’d kept in reserve for a long time. You may remember my statistical analysis of Calvin & Hobbes. For this, I’d created a script in Perl that could generate Statistically Improbable Phrases (SIPs) for any text. This is based on (a somewhat limited) 23MB corpus of ebooks that I had. I’d wanted to put that up on my website, but …

AppEngine only uses Python. So the first task was to get Python, and then to learn Python. The only saving grace was that I was just cutting-and-pasting most of the time. Google wasn’t helping:

Google AppEngine Over Quota Error

Anyway, the site is up. You can view it at sip.s-anand.net for now. Just type a URL, and it’ll tell you the improbable words in that site.

Visit sip.s-anand.net

Technical notes

I realise that these are statistically improbable words, not phrases. I’ll get to the phrases in a while.

The logic is simple:

  • Get the frequency of words in a corpus. I pre-generated this file. It has over 100,000 words.
  • Get the URL as text. Rather than muck around with Python, I decided to use the W3 html2txt service.
  • Convert the text to words. Splitting text into words is tricky. For now, I’m simply assuming that any group of letters is a word, and anything that’s not a letter is a word delimiter.
  • Find the relative frequency (improbability) of words. This is the frequency in the URL divided by the frequency in the corpus, normalised (i.e. scale it so that the maximum value is 1.0).
  • Create a tag cloud. I use the word frequency as the size and the improbability as the colour. You need a bit of mathematical jugglery to get the pattern right. Right now, I’m taking the 6th root of the improbability and the logarithm of the frequency to get a reasonably smooth tag cloud.

The source code is at statistically-improbable-phrases.googlecode.com.

Update: 12-Apr-2008. I’ve added some interactivity. You can play with the contrast and font size, the filter out common or infrequent words.

Update: 22-Apr-2008. Added concordance. You can click on a word and see the context in which it appears.

Google search in Tamil

When I wrote my Tamil song lyrics quizzes, I had two problems:

  1. I can’t write in Tamil (not on paper, nor on a computer)
  2. I can’t spell right in Tamil (ந vs ன, ர vs ற)

I overcame the first using a Tamil transliterator. I write in English, and you see it in Tamil.

The problem of ந vs ன was simple. ந occurs as the first letter of a word, and just before த. Nowhere else. (Is this always true?)

But ர vs ற can’t be solved except through experience, and I’m short of that. So, rather than bother my family with every quiz, I used the wisdom of crowds. I googled both spellings of the word. The correct spelling has more Google hits than the incorrect one.

I did this so often, I made a Google gadget out of it.

Just type the word in English, click ‘Search’, and my gadget will search in tamil. It’s amazing how much stuff there is in Tamil on the Web, from song lyrics to texts (thirukkuraL, for example).

You can add this gadget to:

  • your desktop (in the Search Gadgets box, type “http://www.s-anand.net/a/tamilsearchgadget.xml”)
  • your website or blog (click here for the code)
  • Google Reader. Add to Google

Here’s the transliteration table:

Tamil English
a
A or aa
i
I or ee
u
U or oo
e
E
ai
o
O
au
k or g
n
ch or s
j
n
t or d
N
th or dh
n
p or b
m
y
r
l
v
zh
L
R
sh
S
h

Making a Tamil transliterator

I’ve built a simple Tamil transliterator. You can type in words in English and it will spell them out in Tamil. You can copy-paste the Tamil above into Microsoft Word, etc.

You may need to turn on tamil scripts to see the Tamil fonts above. If you have Windows 98, it may not work well. If you’ve visited this page recently, you will need to refresh this page as well (press F5).

Browse through my Javascript to see how it works. Feel free to reuse.

I’ve also made a Google Gadget that searches Google in Tamil using this tool.

Here’s what to type:

Tamil English
a
A or aa
i
I or ee
u
U or oo
e
E
ai
o
O
au
k or g
n
ch or s
j
n
t or d
N
th or dh
n
p or b
m
y
r
l
v
zh
L
R
sh
S
h

I also have a gadget that lets you search in Tamil.