Year: 2007

RSS feeds in Excel

The technique of Web lookups in Excel I described yesterday is very versatile. I will be running through some of the practical uses it can be put to over the next few days

TO generalise things beyond just getting the Amazon price, I created a user-defined function called XPATH. It takes two parameters:

URL of the XML feed to read
Search XPath list string (separated by spaces)

This function can be used to extract information out of any XML file on the Web and get it out as a table. For example, if you wanted to watch the Top 10 movies on the IMDb Top 250, and were looking for torrents, an RSS feed is available from mininova. The URL http://www.mininova.org/rss/movie_name/4 gives you an RSS file matching all movies with “movie_name”. From this, we need to extract the <item><title> and <item><link> elements. That’s represented by “//item title link” on my search string.

Mininova RSS feed in Excel

The formula becomes XPath2( "http://www.mininova.org/rss/"&A2&"/4", "//item title link"). The result is a 2-dimensional array returning individual items in rows, and the columns are title and link. Pulling it all together, you can get the sheet above.

All of this could be done using a command-line program. Excel has one huge advantage though. It’s one of the most powerful user-interfaces. Increasingly, I’m beginning to rely on just two user interfaces for almost any task. One is the browser, and the other is Excel. With Excel, I could have a sheet that has my movie wishlist (which changes often), and add check to see if the torrent exists. Every time I add a bunch of movies to the wishlist, it’s just a matter of copying the formula down. No need to visit a torrent search engine and typing each movie in, one by one.

Another example. Someone suggests 10 movies to watch. I’d rather watch the ones with a higher IMDb rating. Again, enter the Web lookup. Type in the movie names. Use a function like this to look up the rating on IMDb, and sort by the rating.

The possibilities are endless.

Science Fiction awards

Now that I’m well on my way to watching the Top 250 movies on IMDb, I’m slowly turned my attention to fiction. My interest is mainly in the Fantasy & Science Fiction area. Unfortunately, I don’t know of any list like the IMDb Top 250, but there are a few awards that could take the place of the Oscars for books. That’s probably a good place to start.

The most popular awards in Science Fiction are the Hugo award and Nebula award, followed by the Philip K Dick award, John W. Campbell award, Arthur C Clarke award and other awards. I collated a list of all the awards (from LocusMag) into the spreadsheet below

Very few of these books have won multiple awards. None have won more than 3 on this list. Of these, only five have 3 awards:

Rendezvous with Rama by Arthur C. Clarke American Gods by Neil Gaiman Neuromancer by William Gibson Gateway by Frederik Pohl Forever Peace by Joe Haldeman

Over 20 books have won two of these awards.

Dune by Frank Herbert The Left Hand of Darkness by Ursula K. Le Guin The Gods Themselves by Isaac Asimov Ringworld by Larry Niven Timescape by Gregory Benford Dreamsnake by Vonda McIntyre The Child Garden by Geoff Ryman Gloriana by Michael Moorcock Fairyland by Paul J. McAuley The Forever War by Joe Haldeman Jonathan Strange & Mr Norrell by Susanna Clarke Doomsday Book by Connie Willis Boy's Life by Robert McCammon Paladin of Souls by Lois McMaster Bujold A Deepness in the Sky by Vernor Vinge The Fountains of Paradise by Arthur C. Clarke Speaker for the Dead by Orson Scott Card The Dispossessed by Ursula K. Le Guin The Time Ships by Stephen Baxter Startide Rising by David Brin Ender's Game by Orson Scott Card

I haven’t read most of these books. The ones I have read are:

  1. The Gods Themselves by Isaac Asimov (which I loved)
  2. Ender’s Game by Orson Scott Card (which I loved)
  3. Speaker for the Dead by Orson Scott Card (which was OK)
  4. Rendezvous with Rama by Arthur C. Clarke (which I didn’t like)
  5. Dune by Frank Herbert (which I didn’t like)

That’s about a 50% ratio, so I guess this list doesn’t quite very well work for me. Or at least, my taste doesn’t match the Award critics. But maybe you will find something interesting to read…

Tamil spelling corrector

The Internet has a lot of tamil song lyrics in English. Finding them is not easy, though. Two problems. The lyrics are fragmented: there’s no one site to search them. And Google doesn’t help. It doesn’t know that alaipaayudhe, alaipaayuthe and alaipayuthey are the same word.

This is similar to the problem I faced with tamil audio. The solution, as before, is to make an index, and provide a search interface that is tolerant of English spellings of Tamil words. But I want to go a step further. Is it possible to display these lyrics in Tamil?

My Tamil Transliterator does a lousy job of this. Though it’s tolerant of mistakes, it’s ignorant of spelling and grammer. So,

kanda nal muthalai kathal peruguthadi

… becomes…

kanda nal muthalai kathal peruguthadi

… when in fact we want…

kanda naaL muthalaay kaathal peruguthadi

(If you’re viewing this on an RSS reader, check my post to see what I mean.)

I need an automated Tamil spelling corrector. Reading Peter Norvig’s “How to Write a Spelling Corrector” and actually having understood it, I gave spelling correction in Tamil a shot.

Norvig’s approach, in simple terms, is this:

  1. Get a dictionary
  2. Tweak the word you want to check (add a letter, delete one, swap 2 letters, etc.)
  3. Pick all tweaks that get you to a valid word on the dictionary
  4. Choose the most likely correction (most common correct word, adjusted for the probability of that mistake happening)

Making a dictionary is easy. I just need lots of Tamil literature, and then I pick out words from it. For now, I’m just using the texts in Project Madurai.

Tweaking the word to check is easy. Norvig’s article has a working code example.

Picking valid tweaks is easy. Just check against the dictionary.

The tough part is choosing the likely correction. For each valid word, I need the probability of having made this particular error.

Let’s take an example. I’ve spelt kathal. A list of valid tweaks to this word include: kal, kol, kadal, kanal, and kaadhal. For each of these, I need to figure out how often the valid tweaks occur, and the probability that I typed kathal when I really meant one of these tweaks. This is what such a calculation would look like:

Tweak Frequency Probability of typing kathal Product
kal 1 0.04 0.04
kol 4 0.02 0.08
kadal 10 0.1 1.0
kanal 1 0.01 0.01
kaadhal 6 0.25 1.50

Once we have this, we can see that kaadhal is the right one — it has the maximum value (1.50) in the last column, where we multiply the frequency and the probability.

(You probably realise how small my dictionary is, looking at the frequencies. Well, that’s how big Project Madurai is. But increasing the size of a dictionary is a trivial problem.)

Anyway, getting the frequency is easy. How do I get the probabilities, though? That’s what I’m working on right now.

Ivory sculptures

Ivory sculptures at the Guangzhou Chen Family Temple. The first two, especially, have spheres within spheres within spheres… which looks impossible to carve.

Intricately carved ivory at the Chen Family Temple Intricately carved ivory at the Chen Family Temple

Intricately carved ivory at the Chen Family Temple Intricately carved ivory at the Chen Family Temple

Splitting a sentence into words

I often need to extract words out of sentences. It’s one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn’t as easy as you think.

Think about it. What is a word?

Something that has spaces around it? OK, let’s start with the simplest way to get words: split by spaces. Consider this piece:

"I'd look at McDonald's," he said.
"They sell over 3,000,000 burgers a day -- at $1.50 each."
High-fat foods were the rage. For e.g., margins in fries
were over 50%... and (except for R&M & Dyana [sic]) everyone
was at ~30% net margin; growing at 25% too!

Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following:

"I'd
look
at
McDonald's,"
...

Now, some of these like “I’d” are words. But “McDonald’s” isn’t. I mean, there’s a full-stop and a double-quotes at the end. Clearly we need to remove the punctuation as well. But, if we do that, I'd becomes Id. So we need to be careful about which punctuation to remove. Let’s take a closer look.

The following punctuation marks are clear word separators: spaces, the exclamation mark, the question mark, semicolon, brackets of any kind, and double-quotes (not single quotes). No word has these in the middle. If we use these as separators, our list of words is better, but we still have some words with punctuation:

McDonald's,
e.g.,
High-fat
R&M
...

The issue is, these punctuation marks are ambiguous word separators: comma, hyphen, single-quote, ampersand, period and slash. These usually separate words, but there are exceptions:

Comma
Not inside numbers: 3,000,000.
Hyphen
Not for hyphenated words: High-fat.
Single-quote
Not for possessives: McDonald’s. Not for joint words: I’d.
Ampersand
Not for abbreviations: R&M
Period
Not for abbreviations: O.K. Not for URLs: www.s-anand.net
Slash
Not for fractions: 3/4. Not for URLs: google.com/search

Colon is ambiguous too. In normal English usage, it would be a separator. But URLs like http://www.s-anand.net/ use these characters, and it doesn’t make sense to separate them.

So here are my current rules for splitting a sentence into words. (It’s a Perl regular expression. Don’t worry. Cooper’s Law: If you do not understand a particular word in a piece of technical writing, ignore it. The piece will make perfect sense without it.)

# Split by clear word separators
/       [\s \! \? \;\(\)\[\]\{\}\<\> " ]
 
# ... by COMMA, unless it has numbers on both sides: 3,000,000
|       (?<=\D) ,
|       , (?=\D)
 
# ... by FULL-STOP, SINGLE-QUOTE, HYPHEN, AMPERSAND, unless it has a letter on both sides
|       (?<=\W) [\.\-\&]
|       [\.\-\&] (?=\W)
 
# ... by QUOTE, unless it follows a letter (e.g. McDonald's, Holmes')
|       (?<=\W) [']
 
# ... by SLASH, if it has spaces on at least one side. (URLs shouldn't be split)
|       \s \/
|       \/ \s
 
# ... by COLON, unless it's a URL or a time (11:30am for e.g.)
|       \:(?!\/\/|\d)
/x;

This doesn’t even scratch the surface of the issue, though. Here are some issues:

  • Lots of files split words into two at the end of a line. How do we handle that?
  • How do we handle incorrect punctuation? For instance, if someone types “done.Yet,” without leaving a space after the full-stop, I’ll think it’s an abbreviation.
  • What about other separators? Like the ± symbol or the £ symbol for instance.
  • What about other languages?!

And you thought it was easy!

HTTP download speeds

In some of the Web projects I’m working on, I have a choice of many small files vs few big files to download.

There are conflicting arguments. I’ve read that many small files are better, because you can choose to use only the required files, and they’ll be cached across the site. (These are typically CSS or Javascript files.) On the other hand, a single large file takes less time to download than the sum on many small files, because there’s less latency. (Latency is more important than bandwidth these days.)

I ran some tests, and the answer is rather easy. The graph below shows the average time taken to download a file of size 1KB – 100KB.

Time to download a file of size ranging from 1KB - 100KB

The X-axis is the size of the file. The Y-axis is the number of milliseconds taken to download the file, averaged over 10 attempts on my home computer. (I did the same thing at work and at a client site. The results are similar.)

The key thing is, it’s not linear. Larger files take less time. Specifically:

  • A file twice as big only takes 30% longer to load.
  • Breaking a file into two parts takes 54% longer to load.

These days, it looks like few big files are better.

To give you an example: my home page had the following components:

Size (KB) Load time (ms)
HTML 25 680
CSS 4 340
Javascript 6 400
Total 35 1420

The total time for these 3 files would be about 1.4 seconds. Instead, if I put them all on one file…

Size (KB) Load time (ms)
All-in-one 35 770

The combined file takes only 0.77 seconds — half the download time for split files. It’s a compelling argument to put all your CSS and Javascript (images, too, if possible) into a single page.

But what if people visit multiple pages on my site, and lose the benefit of caching? Not really. The benefit of caching is small. By having a single file, I have 770 – 680 = 90 ms additional time for each HTML to load. But I don’t have to load the CSS and Javascript individually, which takes 740 seconds. The breakeven is about 740 / 90 = 8 page visits.

So, on average, if people, visit more than 8 pages in my site, it’s worth breaking up the CSS and Javascript. But the average for my site is only 2 pages per person. (It’s a skewed distribution. Most, from Google, just visit one page. Few, from bookmarks, visit several pages. On average, it’s just over 2.) I’d argue that unless you’re using an external Javascript / CSS library (prototype, etc.) or people view many pages each visit, you’re better of having a single HTML+Javascript+CSS file.

Sparklines

John Resig has written a Sparklines library. Here’s an example. I wrote that HTTP download speeds not linear and that they flatten out over time. A linear line would look like this: The little red line here is a sparkline that’s based on real data. John’s javascript converts the data into a graph.

Sparklines were introduced by Edward Tufte.