Faster data crunching

I’ve been playing with big data lately.

The good part is, it’s easy to get interesting results. The data is so unwieldy that even average value calculations provoke a “Amazing! I didn’t know that,” response (No exaggeration. I heard this from two separate ~ $1bn businesses this month.)

The bad part is that calculating even that simple average is slow.

For example, take this 40MB file (380MB unzipped) and extract the first column.

The simplest Python script to get the first column looks like this:

for row in csv.reader(fileinput.input(), delimiter='\t'):
    if len(row) > 0: print row[0]

That took a good 3 minutes to execute on my laptop.

Since I’m used to UNIX data processing, I tried cut -f1. Weirdly, that’s worse. 5 minutes. Paradoxically, awk '{print $1}' only takes 17 seconds. That's about 12 times faster. Clearly the tool makes a big difference. And we always knew UNIX was fast.

But I also ran these on an Amazon EC2 server, and a Hostgator server. Here’re the results.

  python cut awk
My Dell E5400 3:04 (1x) 5:42 (0.5x) 0:17 (11x)
EC2 standard 0:33 (6x) 0:5.6 (33x) 0:16 (11x)
Hostgator 0:19 (10x) 0:2.5 (74x) 0:0.7 (265x)

What took 3 minutes with Python my Dell E5400 took less than a second on Hostgator’s server with awk. Over 250 times faster. (Not 250%. 250 times).

And it’s not just hardware. A good tool (awk) made things 11x faster on my machine. Good hardware (hostgator) made the same program 10x faster. But choosing the right combination can make things go faster than 11 x 10 = 110 times. Much faster.

There are a few of things I’m taking away from this.

  1. Good hardware can speed you up much as (or more than) choosing the right tool.
  2. Good hardware can be rented. From many places. Cheaply.
  3. Always test what’s fast. awk’s fastest on my machine and Hostgator, but not on EC2.