evan_tech

Previous Entry Share Next Entry
06:19 pm, 14 Apr 06

packrat parsing: clever

A friend from work pointed me at Packrat Parsing:
[...] Top-down parsers can in turn be divided into two categories. Predictive parsers attempt to predict what type of language construct to expect at a given point by “looking ahead” a limited number of symbols in the input stream. Backtracking parsers instead make decisions speculatively by trying different alternatives in succession: if one alternative fails to match, then the parser “backtracks” to the original input position and tries another. Predictive parsers are fast and guarantee linear-time parsing, while backtracking parsers are both conceptually simpler and more powerful but can exhibit exponential runtime.

This paper presents a top-down parsing strategy that sidesteps the choice between prediction and backtracking. Packrat parsing provides the simplicity, elegance, and generality of the backtracking model, but eliminates the risk of super-linear parse time, by saving all intermediate parsing results as they are computed and ensuring that no result is evaluated more than once. The theoretical foundations of this algorithm were worked out in the 1970s [3, 4], but the linear-time version was apparently never put in practice due to the limited memory sizes of computers at that time. However, on modern machines the storage cost of this algorithm is reasonable for many applications. Furthermore, this specialized form of memoization can be implemented very elegantly and efficiently in modern lazy functional programming languages, requiring no hash tables or other explicit lookup structures.
As that snippet indicates, there are two interesting things going on here:
  1. The first is pointing out a "dynamic programming" (I hate that term) approach for tabular parsing (described in the paper), which makes backtracking parsing run in linear time.
  2. The second is that this has a cute lazy functional (Haskell) implementation.

    It uses recursive functions, of course, but it also uses recursive data in that it creates a record (struct) where the fields are defined in terms of something that holds the record itself. I still haven't fully wrapped my head around that.
Each independently is pretty neat, while putting them together earns this the Evan stamp of "hmm, interesting."

Also interesting is that the author of that paper appears to be, around the same time as publishing papers on parsing, writing docs on P2P NAT traversal along with Dan Kegel (whose C10K page you may recognize).