(I used to love Stan Kelly-Bootle's column in Unix World, so i thought i'd share an experience a little like the ones he used to write about. Hope some old-timers out there can get into it...)
The task i was working on involved taking a file containing a very large directory listing (about 158 MB) and determining the total size of all the files listed in it. The file's contents looked like this:
$ head -5 transaction.list -rw-r--r-- 1 root root 6575 Aug 5 2009 file-7647833002.log -rw-r--r-- 1 root root 8223 Aug 5 2009 file-8304157181.log -rw-r--r-- 1 root root 6929 Aug 5 2009 file-7605687521.log -rw-r--r-- 1 root root 6802 Aug 5 2009 file-8408844563.log -rw-r--r-- 1 root root 6787 Aug 5 2009 file-8420786471.log
So to sum the size of the files, i thought i'd write a one-line awk script. But then i second-guessed myself. I thought: for a file this size, perl has to be faster, right? So i wrote a perl one-liner instead. When i ran it first, it took a lot longer than i expected, so i checked the time it took:
$ time perl -we 'my $sum = 0; while (<>) { my @F = split; $sum += $F[4]; } printf "%dn", $sum; ' transaction.list 53951193376 real 0m8.062s user 0m7.970s sys 0m0.080s
This seemed a little excessive to me, so i went back and ran the awk script which i had originally intended to write, and it turned out to be more than 4 times faster:
$ time awk '{ SUM+=$5 } END {printf "%dn", SUM}' transaction.list 53951193376 real 0m1.474s user 0m1.390s sys 0m0.040s
Then i thought, "obviously i'm just a hack and i don't know how to make perl sing". So here was the next cut:
$ time perl -we 'my $sum = 0; while (<>) { my ($size) = /d+[^d]+(d+)/; $sum += $size; } printf "%dn", $sum; ' transaction.list 53951193376 real 0m4.387s user 0m4.300s sys 0m0.070s
Nearly twice as fast as the first perl version, but still nearly 3 times slower than the awk version.
I couldn't be bothered optimising it any further, but i wondered: is there an inherent performance limitation in perl's split function, or is it just that the overhead in booting up the perl interpreter is higher?
I ran these scripts on my laptop, a Lenovo ThinkPad X200s, with an Intel Core 2 Duo SL9400 CPU and 4 GB RAM, running Ubuntu Linux 10.04 (lucid) 64-bit. A few of my normal desktop apps were also running. I ran the scripts a few times each in succession to ensure that i was getting reasonably reliable results.
Any thoughts? How could i have written the perl version more efficiently?