Previous Entry Share Next Entry
10:14 pm, 9 Feb 05


I don't know hardly anything about compression, but a friend pointed me at the algorithm presented in A Block-sorting Lossless Data Compression Algorithm -- also known as the bzip2 algorithm -- and I was blown away. The basic approach is a simple pre-transformation of the data that makes traditional compression work better, but the transformation itself is sorta insane: you wouldn't expect it to work. The paper has an easy-to-follow example, too, so you have no excuse not to look at it. (The encoding is pretty simple, and takes only one page (page 2) of the paper; what's crazy is that you can get the original data back from it.)

It also turns out Mike Burrows, the first author of the paper ("the 'b' in 'bzip', I was told), sits about 20 feet from me. Tonight I introduced myself and we chatted about it for a bit. He added some amusing anecdotes. Such as:

The paper claims it has performance comparable to Lempel-Ziv while achieving better compression. This was true at the time, but Mike said that the final deshuffling step has poor cache performance so as computers have increased in speed and cache dependency, this algorithm has fallen behind gzip. At a recent compression conference (celebrating the 10th anniversary of the publishing of the paper, though he pointed it was actually discovered in 1978 -- the number they have in the paper is wrong) he proposed some methods to speed it up by a factor of two. But the compression folks were only interested in O() improvements, and since somebody had already proved it was O(n) (with a large constant) nobody was interested.

Mike seems like a generally hax0ry fellow, recently getting credit for writing some tight assembly for an upcoming Google project.

[Obligatory programming-language-related footnote: the author of the bzip2 library works on ghc, the Haskell compiler.]