evan_tech

Previous Entry Share Next Entry
10:33 pm, 18 Apr 05

side effects

I recently read about an interesting bug someone had encountered. It was one of those heisenbugs that couldn't seem to be pinned down, where removing one part of the program made the other part work. They eventually tracked it down to Python's re.compile.

One part of code called something like re.compile(u'.*') and the other re.compile('.*'). (In Python, prefixing a string literal with "u" makes it a Unicode string.) The problem turned out to be something like this: while those two regular expressions have different behaviors (the regular expression engine uses the Unicode flag to vary the matching behavior, I imagine), internally the re module caches the generated regular expression objects and only used their string values (and not the Unicode flag) as the key in the cache.

Depending on which of those two re.compiles happened first, both would evaluate to either a Unicode regexp or a non-Unicode regexp.

So it turned out to be Python bug, and in fact one that has already been fixed. But the reason this is interesting to me is the reason why it's hard to track down. Nowhere in this programmer's (or my) mental model of how this code worked was the notion that a call to re.compile would behave differently depending on where you called it.

The technical term is that you expect re.compile to be referentially transparent. This Python bug is easy to have in code written in pretty much any language (including my beloved O'Caml) because functions are effectively just subroutines.

In the Haskell gospels, they talk about "substituting equals for equals": all functions are by definition referentially transparent unless they're explicitly marked as having state. It's because Haskell sorta doesn't have an order of operations; you're just expressing definitions. In fact, one cool way to view monads that I only grokked recently is that they're a way of expressing a sequence of computations. In that light, a series of statements in a monad correspond to a series of "let" expressions in ML, which uses let expressions as a way of expressing order of evaluation.

It's really refreshing to think that this language is boiling away in my brain, changing structure and giving me new perspectives on ideas already familiar.



P.S.: I had a related bug today in some code that I had borrowed from Russell. Some function was defined as foo(Bar bar, int baz) and I helpfully changed it to foo(Bar& bar, int baz) so bar wouldn't have to be passed on the stack. But of course, the properly safe transformation is to const Bar& bar, and had I done that I wouldn't have lost a day. (Bugs in programs that process large amounts of data are hard to track down.) Mostly stupid on my part.