An interesting performance difference between perl and awk

(I used to love Stan Kelly-Bootle's column in Unix World, so i thought i'd share an experience a little like the ones he used to write about.  Hope some old-timers out there can get into it...)

The task i was working on involved taking a file containing a very large directory listing (about 158 MB) and determining the total size of all the files listed in it.  The file's contents looked like this:

$ head -5 transaction.list
-rw-r--r--    1 root     root         6575 Aug  5  2009 file-7647833002.log
-rw-r--r--    1 root     root         8223 Aug  5  2009 file-8304157181.log
-rw-r--r--    1 root     root         6929 Aug  5  2009 file-7605687521.log
-rw-r--r--    1 root     root         6802 Aug  5  2009 file-8408844563.log
-rw-r--r--    1 root     root         6787 Aug  5  2009 file-8420786471.log

So to sum the size of the files, i thought i'd write a one-line awk script.  But then i second-guessed myself.  I thought: for a file this size, perl has to be faster, right?  So i wrote a perl one-liner instead.  When i ran it first, it took a lot longer than i expected, so i checked the time it took:

$ time perl -we 'my $sum = 0; while (<>) { my @F = split;
$sum += $F[4]; } printf "%dn", $sum; ' transaction.list
53951193376

real    0m8.062s
user    0m7.970s
sys    0m0.080s

This seemed a little excessive to me, so i went back and ran the awk script which i had originally intended to write, and it turned out to be more than 4 times faster:

$ time awk '{ SUM+=$5 } END {printf "%dn", SUM}' transaction.list
53951193376

real    0m1.474s
user    0m1.390s
sys    0m0.040s

Then i thought, "obviously i'm just a hack and i don't know how to make perl sing".  So here was the next cut:

$ time perl -we 'my $sum = 0; while (<>) { my ($size) = /d+[^d]+(d+)/;
 $sum += $size; } printf "%dn", $sum; ' transaction.list
53951193376

real   0m4.387s
user   0m4.300s
sys    0m0.070s

Nearly twice as fast as the first perl version, but still nearly 3 times slower than the awk version.

I couldn't be bothered optimising it any further, but i wondered: is there an inherent performance limitation in perl's split function, or is it just that the overhead in booting up the perl interpreter is higher?

I ran these scripts on my laptop, a Lenovo ThinkPad X200s, with an Intel Core 2 Duo SL9400 CPU and 4 GB RAM, running Ubuntu Linux 10.04 (lucid) 64-bit.  A few of my normal desktop apps were also running.  I ran the scripts a few times each in succession to ensure that i was getting reasonably reliable results.

Any thoughts?  How could i have written the perl version more efficiently?

Related posts