20071205

Perl's Diamond Operator in Python

I'm been a full Python convert for over a year now. Perl is just too cumbersome at times. I think the artist who draws XKCD has made the same choice. The one thing that still takes me back to Perl is the ultra-handy <> "diamond operator". You can't deny Perl's dominance when it comes to string processing. Look at this elegance:

#!/usr/bin/perl

while (<>) {
    print
}

Every time I encounter a string processing problem where I have to work with multiple files, I go back to Perl. I didn't think Python could do this. Python should have the same ability as Perl's diamond:

def diamond():
    """
    diamond()
    Pre: Nothing.
    This is my attempt to recreate the very handy diamond operator found in
    Perl, which is my only reason for still using that language.

    To use the code:
    for line in diamond():
        process(line)
    """
    import sys
    if len(sys.argv[1:]) > 0:
        for file in sys.argv[1:]:
            if file == '-':
                for line in sys.stdin:
                    yield line
            else:
                f = open(file, 'r')
                for line in f:
                    yield line
                f.close()
    else:
        for line in sys.stdin:
            yield line

What does this do? It checks the sys.argv variable for file names on the command line and iterates through each line of each file. If it can't find a file, it defaults to standard input.

Of course, before I start to write code, I should always use Google first. There exist a module called "fileinput" which does the same thing.

D'oh! Hey... it's practice.

20071202

A Losing Strategy

Today I ran across a year 2000 article in the New York Times titled "Paradox in Game Theory: Losing Strategy That Wins." Given two games which lose most of the time, a strategy can be devised so that we can win most of the time. The article has a (fairly) detailed description of two example games:

The paradox is illustrated by two games played with coins weighted on one side so that they will not fall by chance to heads or tails.

In game A, a player tosses a single loaded coin and bets on each throw. The probability of winning is less than half.

In game B, there are two coins and the rules are more complicated. The player tosses either Coin 1, loaded to lose almost all the time or Coin 2 loaded to win more than half the time. He plays Coin 1 if his money is a multiple of a particular whole number, like three.

If his money cannot be divided evenly by that number, he plays Coin 2. In this setup, the second coin will be played more often than the first.

After reading this story, I wrote a quick Python script to demonstrate this game. The problem description isn't definitive on things like the exact probabilities of the coin flips, so we have to make a few assumptions. I set the probability of the coin flip in Game A to p1=0.33, and in Game B, the probability of coin1 to p=0.1 and coin2 to p=0.66. These probabilities are pulled from the air, but they do meet the requirement set forth in the problem description, so they should work. So the two games alternating should outperform both Game A and Game B when run indivdually. Let's run the program a few times to see if a combination of Games A and B are better than Game A or Game B.

[jcchurch@mcart python]$ python badgames.py 
Game A, 1000 runs:
  Before: 1000 MONEY
  After:  674 MONEY

Game B, 1000 runs:
  Before: 1000 MONEY
  After:  978 MONEY

Alternating: 1000 runs:
  Before: 1000 MONEY
  After:  850 MONEY

[jcchurch@mcart python]$ python badgames.py 
Game A, 1000 runs:
  Before: 1000 MONEY
  After:  734 MONEY

Game B, 1000 runs:
  Before: 1000 MONEY
  After:  890 MONEY

Alternating: 1000 runs:
  Before: 1000 MONEY
  After:  870 MONEY

[jcchurch@mcart python]$ python badgames.py 
Game A, 1000 runs:
  Before: 1000 MONEY
  After:  636 MONEY

Game B, 1000 runs:
  Before: 1000 MONEY
  After:  932 MONEY

Alternating: 1000 runs:
  Before: 1000 MONEY
  After:  820 MONEY

[jcchurch@mcart python]$ python badgames.py 
Game A, 1000 runs:
  Before: 1000 MONEY
  After:  604 MONEY

Game B, 1000 runs:
  Before: 1000 MONEY
  After:  918 MONEY

Alternating: 1000 runs:
  Before: 1000 MONEY
  After:  846 MONEY
I just ran the problem described in the New York Times 4 times and got 4 negative results. Game B always outperforms Game A and the combined Games A and B. This article looked very interesting and very improbable. If it's too good to be true, check for yourself.

20070908

Three Parts of Division

I was going to update this blog at some point.

I'm back to solving problems on Project Euler. I've solved the easy ones, and now I'm getting around to the medium difficulty problems. I like #26.

Problem 26: Find the value of d < 1000 for which 1/d contains the longest recurring cycle in its decimal fraction part.

That's kinda interesting. Every time you divide two numbers, there are three parts: The integer part, the non-repeating decimal part, and the repeating decimal part.

I wrote up this little piece of Python to compute the three parts. It is a forced base-10 long division calculator, and it makes good use of Python's stings doubling as list, which is a feature that I always thought was nice about Python.

def division(num, den):
    ipart = str(num / den);
    fpart = ""
    rpart = ""
    seen  = [ num%den ]
    while num % den != 0:
        num        = (num%den) * 10
        multiple   = num / den
        fpart     += str(multiple)
        num        = num - (den * multiple)
        c = 0
        quit = False
        for i in seen:
            if num == i: # This number has been seen before.
                rpart = fpart[c:] # Make characters from c to end of list part of repeating part
                fpart = fpart[:c] # Truncate fpart
                quit = True
                break
            c = c + 1
        if quit:
            break
        seen += [num]
    return [ipart, fpart, rpart]

This method accepts two arguments: An integer numerator and an integer denominator. The return is a list with three items: The whole number part, the non-repeating decimal part, and the repeating decimal part. Each part is a string data type (because sometimes these parts can be excessively long for python's Integer type.)
>>> division(1,4)
['0', '25', '']
>>> division(1,6)
['0', '1', '6']
>>> division(1,7)
['0', '', '142857']

20070425

Converting ISBN13 to ISBN10 in PHP

I've been writing a textbook buyback system for a buddy of mine. He runs a service that purchases textbooks from students at the end of a school semester. He wanted me to write a program to query Amazon.com and estimate the best value to offer a student for his or her book.

As of January 1, 2007, the agency that controls ISBN codes for all books printed in the world has changed their code format from 10 digits (regular expression /[0-9Xx]{10}/) to 13 digits (regular expression /[0-9]{13}/). Amazon has not updated their Web E-Commerce API to accommodate this change in standard. Thus, any time I get an 13-digits ISBN, I must convert it to the older 10-digits standard.

When "Googling" for solutions to this problem, I found several, but I decided to write this one myself.

// Converts ISBN-13 to ISBN-10
// Leaves ISBN-10 numbers (or anything else not matching 13 consecutive numbers) alone
function ISBN13toISBN10($isbn) {
    if (preg_match('/^\d{3}(\d{9})\d$/', $isbn, $m)) {
        $sequence = $m[1];
        $sum = 0;
        $mul = 10;
        for ($i = 0; $i < 9; $i++) {
            $sum = $sum + ($mul * (int) $sequence{$i});
            $mul--;
        }
        $mod = 11 - ($sum%11);
        if ($mod == 10) {
            $mod = "X";
        }
        else if ($mod == 11) {
            $mod = 0;
        }
        $isbn = $sequence.$mod;
    }
    return $isbn;
}

20070322

Handling Multiple MySQL Queries - "AS" labels

I'm working on a project for the School of Pharmacy on campus. They wanted me to write a program that pulls numbers from a database on Marijuana into a nicely formatted report that they could print off and send to government agencies that are interested in the data. One of the reports requires me to pull hundreds of individually calculated items from a MySQL database. A single MySQL query isn't able to do the job. When this report is completed, there will probably 75 to 100 queries. To think that this quarterly report has been done manually for almost 2 decades overwhelms me as I try to write this report software. I'm trying to reduce several weeks of work to prepare this report to a few seconds.

The entire project is done in PHP and MySQL. PHP and MySQL go hand-in-hand, but every time I write a query, I can't help but wonder how this could be made easier. Pulling one value out of a query looks like spaghetti code. Pulling 300 values from 100 different queries is an explosion of crappy code. MySQL queries required for this report look like this:

SELECT THC FROM samples HAVING MAX(THC);

This is just one query that will find the one sample in the database with the highest THC composition and report what that composition value is. There are hundreds more queries in the report similar to this one. Typically, when you run a MySQL query from PHP, here are the steps you have to take:

  1. Format the query command.
  2. Execute the query command.
  3. Retrieve the first row of results from the successful execution (usually in the form of an associative array).
  4. Transfer those values from the array to variables that actually mean something to the nature of the program (like $max_thc_composition).

If I have 100 queries, and I have to perform the same 4 steps for each query, that's 400 lines of code. Obviously, as any freshmen computer science instructor will tell you (because I am one), this needs to be regulated to a function. But how do you specify a query (which can return more than one value) and the variables associated with those results in a concise manner? If the function doesn't know the variables names going into the function, it can't assign those values coming out of the function.

SELECT THC AS max_thc_composition FROM samples HAVING MAX(THC);

The "AS" expression in MySQL will rename a column into something else. It can also be used to uniquely label a query. This may not seem useful in a single call, but when you have a few 100 calls to make, it's a real time saver.

    //runQueries: Runs multiple queries.
    // Pre: An array of MySQL single-result queries where
    // each result has a unique "AS" expression.
    // Post: An associative array populated with all of those values.
    function runQueries($sqls) {
        $report = array();
        //Execute each SQL query and drop the result in the report array
        foreach ($sqls as $query) {
            $result = mysql_query($query) or die("Query Failed: $query Reason: ".mysql_error());
            $row    = mysql_fetch_assoc($result);

            // Essentially, for each "AS" expression in the query,
            // look for that result in row returned and pull that value out if it exist.
            // If it does not exist, just give it a value of 0.
            preg_match_all("/ AS (\w+)/", $query, $keys);
            foreach ($keys[1] as $key)
                if (isset($row[$key]))
                    $report[$key] = $row[$key];
                else
                    $report[$key] = 0;
        }
        return $report;
    }

This function takes an array of MySQL single-result query strings, where each result has an "AS" label identifying what this result means. This method executes each query and drops the result into a associative array where the "AS" labels double as the array keys. And of course, it uses a regular expression.

I'm suppose to be teaching a class on Python programming in the fall. Have you noticed that I haven't featured a Python script in this blog series yet? That bothers me.

20070222

Finding lost files

So a student e-mails me today saying that he doesn't have a grade recorded for one of his programming assignments. I check his folder of turned-in assignments and I can't find it.

Who knows? Maybe I just dropped his homework in a different folder by mistake. Checking through everyone's folders for his will take me a while. I make all of my students put their e-mail address in the comments of their code, so I just run a search on all the *.java files on my computer that have a certain e-mail address with this handy line:

for i in $(find . -name *.java); do grep -H 'emailName' $i; done
(Replace 'emailName' with the descriptive feature of the file you are looking for.)

I still can't find his homework.

20070124

Regular Expression Speeds

I read a paper today titled "Regular Expression Matching Can Be Simple And Fast". I thought his results were interesting enough to compare to my own regular expression engine. (Yes, I'm dorky enough to have my own engine.) I call my program "chrep", which is short for "Church's Regular Expression Parser".

Below, I've tested four engines with the string "aaaaaaaaaaaaaaaaaaaa" against the expression "a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa" (which should always return a successful match). (Note: Sorry if you aren't able to read the entire command. My CSS-foo sucks and you probably have to read the source code.)

jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | chrep 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa'
[aaaaaaaaaaaaaaaaaaaa] in : aaaaaaaaaaaaaaaaaaaa

real    0m5.137s
user    0m2.964s
sys     0m0.012s
5 seconds. Not bad, but not really all that good. The next program tested was egrep:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | egrep 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa'
aaaaaaaaaaaaaaaaaaaa

real    0m0.055s
user    0m0.008s
sys     0m0.008s
0.055 seconds. Roughly 100 times faster than my parser. The speed benchmark has been set pretty high. The next test is the old, reliable AWK:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | awk '/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa/{print}'
aaaaaaaaaaaaaaaaaaaa

real    0m0.049s
user    0m0.004s
sys     0m0.004s
0.049 seconds. Still making my program look bad. Finally, I wanted to test Perl:
jcchurch@linux:~> time echo "aaaaaaaaaaaaaaaaaaaa" | perl -ne'print if /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa/'
aaaaaaaaaaaaaaaaaaaa

real    0m0.393s
user    0m0.328s
sys     0m0.008s

Perl takes almost a half second. That's horribly slow compared to grep and awk, but still 13 times faster than my parser.

My parser ('cause I know you are wondering) is a simple interpretative, recursive decent parser. There is no compiling of the expression, which I know would speed things up. I'm glad to see that I'm not too far behind the competition, and this will encourage me to make those benchmarks.

20070109

XOR Logic

It's winter break, so I have had less "essential" programming to do. I have been wasting time on projecteuler.net. This site is an on-line competition devoted to solving very difficult math problems of the type that require programs to solve. I've been on the site for about two weeks and I've solved a third of the problems.

The problem that has interested me the most is problem #59: Brute force decryption using XOR Logic. I can't talk about the solution here, because that would only help people who are trying to win the competition by Googling answers, but I can talk about XOR Logic.

XOR Logic is the comparison between two items where the results is true if one is true and the other isn't. If both are false or if both are true, the result is false. In programming terms, we think of it like this:

unsigned char a, b, c;
scanf("%c",&a);
scanf("%c",&b);
c = (~a&b)|(a&~b); // XOR Logic Here!
printf("%c(%d) -> %c(%d) -> %c(%d)\n", a,a, b,b, c,c);

In this example, we have two terms: a and b. We wish to XOR these two items to make a result c. Using C's bitwise and "&", bitwise or "|", and bitwise negate "~", we can create the XOR gate. We have two inputs, and one has to be true and the other must be false for the result to be true. It doesn't matter the order. Output c is true if input a is true and input b is false or if a is false and b is true. In fact, that sentence is pretty much how I programmed it.

English:
Output c is true if input a is true and input b is false
or if a is false and b is true.

English and C:
Output c is true [c = ] if [(] input a is true [a] and(&)
input b is false[~b] [)] or[|] if[(] a is false[~a] and[&]
b is true[b] [)].[;]

C:
c = (~a&b)|(a&~b);

So how is this useful? Cryptography is based on the practice of taking a message, encrypting it, then taking the encrypted message and decrypting it. The great thing about XOR logic is that it is used for both encrypting and decrypting text.

Let's say that your secret message consist of one character: "L". "L" on the ASCII table is decimal 76 or binary "01001100". You need a secret password to encrypt the letter "L". For that, we'll use "!" which is decimal 33 or binary "00100001".

01001100 L (76)
00100001 ! (33)
-------- XOR
01101101 m (109)
"L" has now been encrypted. The encrypted text is now "m". Using our password "!", we can now decrypt the text:
01101101 m (109)
00100001 ! (33)
-------- XOR
01001100 L (76)
We just decrypted our message "m" using the same XOR method and password "!" that we used to encrypt the message. XOR Logic is pretty useful because it provides the method to both encrypt and decrypt messages quickly.