20070124

Regular Expression Speeds

I read a paper today titled "Regular Expression Matching Can Be Simple And Fast". I thought his results were interesting enough to compare to my own regular expression engine. (Yes, I'm dorky enough to have my own engine.) I call my program "chrep", which is short for "Church's Regular Expression Parser".

Below, I've tested four engines with the string "aaaaaaaaaaaaaaaaaaaa" against the expression "a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa" (which should always return a successful match). (Note: Sorry if you aren't able to read the entire command. My CSS-foo sucks and you probably have to read the source code.)

jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | chrep 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa'
[aaaaaaaaaaaaaaaaaaaa] in : aaaaaaaaaaaaaaaaaaaa

real    0m5.137s
user    0m2.964s
sys     0m0.012s
5 seconds. Not bad, but not really all that good. The next program tested was egrep:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | egrep 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa'
aaaaaaaaaaaaaaaaaaaa

real    0m0.055s
user    0m0.008s
sys     0m0.008s
0.055 seconds. Roughly 100 times faster than my parser. The speed benchmark has been set pretty high. The next test is the old, reliable AWK:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | awk '/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa/{print}'
aaaaaaaaaaaaaaaaaaaa

real    0m0.049s
user    0m0.004s
sys     0m0.004s
0.049 seconds. Still making my program look bad. Finally, I wanted to test Perl:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | perl -ne'print if /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa/'
aaaaaaaaaaaaaaaaaaaa

real    0m0.393s
user    0m0.328s
sys     0m0.008s

Perl takes almost a half second. That's horribly slow compared to grep and awk, but still 13 times faster than my parser.

My parser ('cause I know you are wondering) is a simple interpretative, recursive decent parser. There is no compiling of the expression, which I know would speed things up. I'm glad to see that I'm not too far behind the competition, and this will encourage me to make those benchmarks.

1 comment:

Unknown said...

It's insane that yours is already that good :-)