Previous Entry Share Next Entry
I generally find the study of algorithms at the low, which-sort()-is-the-fastest level to be uninteresting. I have no excuse, really: certain things appeal to certain people and this one just isn't for me. (Also, in the work I do, the real speed differences come from the larger rearrangements that the premature-optimization-haters always claim. Stuff like: which machines carry which data, getting rid of O(n^2), etc.)

So please forgive me for my intro-CS level of discussion, but I was curious about this the other day and enjoyed learning about it.

the algorithm
Here's an algorithm (credited to Hoare, not to be confused with graydon) for finding the median of an unsorted array. Actually, it more generally gives you the kth smallest element, and getting the median is just k = length / 2.
  1. Guess that the kth element is the kth smallest.
  2. Partition the array into smaller and greater halves than that guess.
  3. If the smaller half has k-1 elements, the guess was correct.
  4. Otherwise, the kth smallest element is in one of the two halves, so adjust k and recurse into the appropriate one.

The simpler algorithm for this is, of course:
  1. Sort the array.
  2. Take the middle element.
What's neat is that the previous algorithm basically is quicksort, except that you only recurse into the subdivisions that have the one value you care about. (I guess quicksort was invented by the same fellow, so maybe this isn't a surprise.)

Another way of interpreting this is like a depth-first build of a binary tree: the guess is your current node, the partition picks all the nodes that belong on the left and right branches, and the recursion continues building the tree on the branch.

This (or an equivalent function) is in C++, of course, as the function nth_element.

Here's a Haskell implementation of it using lists. (Yeah, not quite the same, but it carries the idea.)
nth n l | n < ltlen = nth n lt
        | n > ltlen = nth (n-ltlen-1) gt
        | otherwise = pivot
  where (pivot,l') = select n l
        (lt, gt)   = partition (< pivot) l'
        ltlen      = length lt
partition is a builtin; select is a function that pulls out the nth element from a list.

But how do I know it works? There's a neat library called "QuickCheck" that does randomized testing: you tell it what sorts of inputs your function expects and a way to test whether the output is correct, and it randomly generates inputs to verify proper functioning.

QuickCheck is one of those libraries that seems to work by magic: no code preprocessing or introspection, just (as always) fancy types.

For example, here's how I write the requirement for nth. I rely on the builtin sort to give me the correct answer.
prop_Nth n l =
  nth n l == sort l !! n
This is just a normal function definition, but you can read the first line as "the property Nth holds for inputs n and l, when ..."
(Here, !! is the nth-element-from-a-list operator. I assume it has a scary-looking name because it's O(n).)

And then you check it with simply:
quickCheck prop_Nth
And QuickCheck can figure out (through type inference, again) that this particular property takes an integer and a list, and automatically generates test cases and runs them.

When I first ran this it found all sorts of places where it'd fail that would be easy to forget with a unit test: for example, when the input list is empty, or when n is not in the bounds of the list. So they provide all sorts of fancy higher-level operations that let you modify the random-input-generators or restrict the inputs, or even have it provide breakdowns of the sorts of inputs it gave ("32% three-element lists.") The manual is hokey-looking but quite impressive.

Here's an interesting thread about how laziness affects big-O.