August 5th, 2006

  • evan

hoare's find algorithm, quickcheck

I generally find the study of algorithms at the low, which-sort()-is-the-fastest level to be uninteresting. I have no excuse, really: certain things appeal to certain people and this one just isn't for me. (Also, in the work I do, the real speed differences come from the larger rearrangements that the premature-optimization-haters always claim. Stuff like: which machines carry which data, getting rid of O(n^2), etc.)

So please forgive me for my intro-CS level of discussion, but I was curious about this the other day and enjoyed learning about it.

the algorithm
Here's an algorithm (credited to Hoare, not to be confused with graydon) for finding the median of an unsorted array. Actually, it more generally gives you the kth smallest element, and getting the median is just k = length / 2.
  1. Guess that the kth element is the kth smallest.
  2. Partition the array into smaller and greater halves than that guess.
  3. If the smaller half has k-1 elements, the guess was correct.
  4. Otherwise, the kth smallest element is in one of the two halves, so adjust k and recurse into the appropriate one.

The simpler algorithm for this is, of course:
  1. Sort the array.
  2. Take the middle element.
What's neat is that the previous algorithm basically is quicksort, except that you only recurse into the subdivisions that have the one value you care about. (I guess quicksort was invented by the same fellow, so maybe this isn't a surprise.)

Another way of interpreting this is like a depth-first build of a binary tree: the guess is your current node, the partition picks all the nodes that belong on the left and right branches, and the recursion continues building the tree on the branch.

This (or an equivalent function) is in C++, of course, as the function nth_element.

Here's a Haskell implementation of it using lists. (Yeah, not quite the same, but it carries the idea.)
nth n l | n < ltlen = nth n lt
        | n > ltlen = nth (n-ltlen-1) gt
        | otherwise = pivot
  where (pivot,l') = select n l
        (lt, gt)   = partition (< pivot) l'
        ltlen      = length lt
partition is a builtin; select is a function that pulls out the nth element from a list.

But how do I know it works? There's a neat library called "QuickCheck" that does randomized testing: you tell it what sorts of inputs your function expects and a way to test whether the output is correct, and it randomly generates inputs to verify proper functioning.

QuickCheck is one of those libraries that seems to work by magic: no code preprocessing or introspection, just (as always) fancy types.

For example, here's how I write the requirement for nth. I rely on the builtin sort to give me the correct answer.
prop_Nth n l =
  nth n l == sort l !! n
This is just a normal function definition, but you can read the first line as "the property Nth holds for inputs n and l, when ..."
(Here, !! is the nth-element-from-a-list operator. I assume it has a scary-looking name because it's O(n).)

And then you check it with simply:
quickCheck prop_Nth
And QuickCheck can figure out (through type inference, again) that this particular property takes an integer and a list, and automatically generates test cases and runs them.

When I first ran this it found all sorts of places where it'd fail that would be easy to forget with a unit test: for example, when the input list is empty, or when n is not in the bounds of the list. So they provide all sorts of fancy higher-level operations that let you modify the random-input-generators or restrict the inputs, or even have it provide breakdowns of the sorts of inputs it gave ("32% three-element lists.") The manual is hokey-looking but quite impressive.

Here's an interesting thread about how laziness affects big-O.
  • evan


I remember a few years ago that brad observed that he was much more productive when he unplugged his network connection. It's not only the distraction of IMs (I avoid IM) or email notifiers (I turned all mine off) or LJ posts (I broke LogJam's LJ friends-update notifier, but I think it might be broken on LJ's side too) -- it's also that I've trained myself to jam on my "friends page" link whenever I encounter the tiniest obstacle. I've even caught myself reloading my friends page while reading my friends page.

Tonight I had a flash of inspiration: iptables lets you do all sorts of packet-mangling trickery, including capturing TCP connections destined for a particular host. So I give you unplug: when you run it, it hijacks all connections to a list of hosts, making any attempt to load their pages instead serve you a page that says "FOCUS". When you kill it, it puts things back to normal.

In any case, the code is shorter than its description, so if you wanted to do something similar yourself, the only part that takes a bit of time to figure out is:
iptables -t nat -A OUTPUT --destination hostname --protocol tcp --jump REDIRECT --to-ports portnum
To list the rules, s/-A.*/-L/. To remove it, s/-A/-D/.
The REDIRECT target only works on the nat table. For more complicated redirecting there's also a DNAT target that'll let you redirect to a different host.