April 12th, 2006

  • evan

regular-expression parsing

"Now you have two problems."

Wikipedia has preliminary support for some new expressions in their markup, including an "if" statement. From the discussion page:
{{if: <condition> | <then text> | <else text> }}

Using pipes (|) as the argument separator is problematic when the <text> sections include wiki table markup (i.e. |-, |, ||) and sometimes even wiki links (i.e. [[Page name|Text to display]]). Would it be possible to use some other character to separate the arguments?
The following discussion gets worse, with discussion of how to perhaps allow escaping of pipe characters. :( :(

Every so often someone at work will have some bright idea about a clever way to use Wikipedia data, get excited, and then realize there's basically no way to process Wikipedia content without running a Wikipedia install to convert it to HTML first. (And once you've done that, you've lost all of the semantic-ish markup...)

My personal website is generated with some scripts I wrote years ago that also use regular expressions to parse the page markup. (I'm always surprised this approach isn't more common: it lets me change the look of things all at once without requiring my server to regenerate each page on every view, making things like If-Modified-Since work right.) The regular expression-based parsing has only bitten me a few times, but that's in part because I made it intentionally simple because I wasn't sure if I was gonna stick with Ruby.

Anyway, as Yet Another "understand Haskell better" project, I've been idly rewriting the system to use Parsec, which is another one of those Haskell things that is so beautiful it hurts to look at parsing with other languages*. It also gives me some nice properties, like making syntax errors very visible because the pages fail to parse. But I keep running into situations where I'm not sure of the right way to do things, and then realize I don't know anybody (aside from maybe graydon) who knows what the "right" way is. I suppose that means I ought to subscribe to one of those Haskell mailing lists rather than posting here.

* Though Perl 6 rules look interesting. Here's a nice overview from a Perl 6 talk by gaal, which actually describes them in relation to Parsec and summarizes with "Perl 6 rules can be as powerful as Parsec". Though, in fairness, it's worth observing that: to Haskell's credit, Parsec can be implemented as a Haskell library instead of a language extension, and to Perl's credit, Perl 6 rules are likely to be more powerful.