20070124

Regular Expression Speeds

I read a paper today titled "Regular Expression Matching Can Be Simple And Fast". I thought his results were interesting enough to compare to my own regular expression engine. (Yes, I'm dorky enough to have my own engine.) I call my program "chrep", which is short for "Church's Regular Expression Parser".

Below, I've tested four engines with the string "aaaaaaaaaaaaaaaaaaaa" against the expression "a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa" (which should always return a successful match). (Note: Sorry if you aren't able to read the entire command. My CSS-foo sucks and you probably have to read the source code.)

jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | chrep 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa'
[aaaaaaaaaaaaaaaaaaaa] in : aaaaaaaaaaaaaaaaaaaa

real    0m5.137s
user    0m2.964s
sys     0m0.012s
5 seconds. Not bad, but not really all that good. The next program tested was egrep:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | egrep 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa'
aaaaaaaaaaaaaaaaaaaa

real    0m0.055s
user    0m0.008s
sys     0m0.008s
0.055 seconds. Roughly 100 times faster than my parser. The speed benchmark has been set pretty high. The next test is the old, reliable AWK:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | awk '/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa/{print}'
aaaaaaaaaaaaaaaaaaaa

real    0m0.049s
user    0m0.004s
sys     0m0.004s
0.049 seconds. Still making my program look bad. Finally, I wanted to test Perl:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | perl -ne'print if /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa/'
aaaaaaaaaaaaaaaaaaaa

real    0m0.393s
user    0m0.328s
sys     0m0.008s

Perl takes almost a half second. That's horribly slow compared to grep and awk, but still 13 times faster than my parser.

My parser ('cause I know you are wondering) is a simple interpretative, recursive decent parser. There is no compiling of the expression, which I know would speed things up. I'm glad to see that I'm not too far behind the competition, and this will encourage me to make those benchmarks.

20070109

XOR Logic

It's winter break, so I have had less "essential" programming to do. I have been wasting time on projecteuler.net. This site is an on-line competition devoted to solving very difficult math problems of the type that require programs to solve. I've been on the site for about two weeks and I've solved a third of the problems.

The problem that has interested me the most is problem #59: Brute force decryption using XOR Logic. I can't talk about the solution here, because that would only help people who are trying to win the competition by Googling answers, but I can talk about XOR Logic.

XOR Logic is the comparison between two items where the results is true if one is true and the other isn't. If both are false or if both are true, the result is false. In programming terms, we think of it like this:

unsigned char a, b, c;
scanf("%c",&a);
scanf("%c",&b);
c = (~a&b)|(a&~b); // XOR Logic Here!
printf("%c(%d) -> %c(%d) -> %c(%d)\n", a,a, b,b, c,c);

In this example, we have two terms: a and b. We wish to XOR these two items to make a result c. Using C's bitwise and "&", bitwise or "|", and bitwise negate "~", we can create the XOR gate. We have two inputs, and one has to be true and the other must be false for the result to be true. It doesn't matter the order. Output c is true if input a is true and input b is false or if a is false and b is true. In fact, that sentence is pretty much how I programmed it.

English:
Output c is true if input a is true and input b is false
or if a is false and b is true.

English and C:
Output c is true [c = ] if [(] input a is true [a] and(&)
input b is false[~b] [)] or[|] if[(] a is false[~a] and[&]
b is true[b] [)].[;]

C:
c = (~a&b)|(a&~b);

So how is this useful? Cryptography is based on the practice of taking a message, encrypting it, then taking the encrypted message and decrypting it. The great thing about XOR logic is that it is used for both encrypting and decrypting text.

Let's say that your secret message consist of one character: "L". "L" on the ASCII table is decimal 76 or binary "01001100". You need a secret password to encrypt the letter "L". For that, we'll use "!" which is decimal 33 or binary "00100001".

01001100 L (76)
00100001 ! (33)
-------- XOR
01101101 m (109)
"L" has now been encrypted. The encrypted text is now "m". Using our password "!", we can now decrypt the text:
01101101 m (109)
00100001 ! (33)
-------- XOR
01001100 L (76)
We just decrypted our message "m" using the same XOR method and password "!" that we used to encrypt the message. XOR Logic is pretty useful because it provides the method to both encrypt and decrypt messages quickly.