03:11 pm, 12 Apr 06
regular-expression parsing
"Now you have two problems."
Wikipedia has preliminary support for some new expressions in their markup, including an "if" statement. From the discussion page:
The following discussion gets worse, with discussion of how to perhaps allow escaping of pipe characters. :( :(
Every so often someone at work will have some bright idea about a clever way to use Wikipedia data, get excited, and then realize there's basically no way to process Wikipedia content without running a Wikipedia install to convert it to HTML first. (And once you've done that, you've lost all of the semantic-ish markup...)
My personal website is generated with some scripts I wrote years ago that also use regular expressions to parse the page markup. (I'm always surprised this approach isn't more common: it lets me change the look of things all at once without requiring my server to regenerate each page on every view, making things like If-Modified-Since work right.) The regular expression-based parsing has only bitten me a few times, but that's in part because I made it intentionally simple because I wasn't sure if I was gonna stick with Ruby.
Anyway, as Yet Another "understand Haskell better" project, I've been idly rewriting the system to use Parsec, which is another one of those Haskell things that is so beautiful it hurts to look at parsing with other languages*. It also gives me some nice properties, like making syntax errors very visible because the pages fail to parse. But I keep running into situations where I'm not sure of the right way to do things, and then realize I don't know anybody (aside from maybe
graydon) who knows what the "right" way is. I suppose that means I ought to subscribe to one of those Haskell mailing lists rather than posting here.
* Though Perl 6 rules look interesting. Here's a nice overview from a Perl 6 talk by
gaal, which actually describes them in relation to Parsec and summarizes with "Perl 6 rules can be as powerful as Parsec". Though, in fairness, it's worth observing that: to Haskell's credit, Parsec can be implemented as a Haskell library instead of a language extension, and to Perl's credit, Perl 6 rules are likely to be more powerful.
Wikipedia has preliminary support for some new expressions in their markup, including an "if" statement. From the discussion page:
{{if: <condition> | <then text> | <else text> }}:(
Using pipes (|) as the argument separator is problematic when the <text> sections include wiki table markup (i.e. |-, |, ||) and sometimes even wiki links (i.e. [[Page name|Text to display]]). Would it be possible to use some other character to separate the arguments?
The following discussion gets worse, with discussion of how to perhaps allow escaping of pipe characters. :( :(
Every so often someone at work will have some bright idea about a clever way to use Wikipedia data, get excited, and then realize there's basically no way to process Wikipedia content without running a Wikipedia install to convert it to HTML first. (And once you've done that, you've lost all of the semantic-ish markup...)
My personal website is generated with some scripts I wrote years ago that also use regular expressions to parse the page markup. (I'm always surprised this approach isn't more common: it lets me change the look of things all at once without requiring my server to regenerate each page on every view, making things like If-Modified-Since work right.) The regular expression-based parsing has only bitten me a few times, but that's in part because I made it intentionally simple because I wasn't sure if I was gonna stick with Ruby.
Anyway, as Yet Another "understand Haskell better" project, I've been idly rewriting the system to use Parsec, which is another one of those Haskell things that is so beautiful it hurts to look at parsing with other languages*. It also gives me some nice properties, like making syntax errors very visible because the pages fail to parse. But I keep running into situations where I'm not sure of the right way to do things, and then realize I don't know anybody (aside from maybe
* Though Perl 6 rules look interesting. Here's a nice overview from a Perl 6 talk by
It's been awhile since I tried to understand all that stuff.
You’re thinking of PGE, the Parrot Grammar Engine.
Meanwhile, you may be interested in stuff by Erik Meijer like HSP.
I could imagine an annotated parsing rule, that would say "we accept any amount of spaces here, but the preferred amount is 2 spaces", but that's more like defining both at once rather than defining one and deducing the other.
once you have unificated an expression, there implicitly (it is the state of your unifier, basically) exists a unique "canonical" pretty-printed representation of that expression, which you can output.
attack of the kitchen sink activists!
Wikipedia has preliminary support for some new expressions in their markup, including an "if" statement.Ugh.
(I don't know if you've looked at wiki languages much. There are a jillion of them, but fortunately there are two standalone parsers, Textile and Markdown, that are becoming widely used. Both of them are a lot better-designed than most, but even they have problems with ambiguity in some cases, especially Textile.)
The way your website regenerates content actually sounds just like MovableType!