evan_tech

Previous Entry Share Next Entry
10:20 pm, 22 Jan 04

more language analysis

At the current rate, I expect my language analyzer to go through about 50,000 posts a day. That isn’t that many, when you consider there are hundreds of posts per minute, but it’s still a lot. Today I wasted a bunch of time trying to work out some of the remaining issues, and they weren’t much fun.

I log the posts it’s unsure about so I can analyze it offline. I noticed one trend of posts that included long strings of gibberish, like “fdadfsafdsa”, “lolololhahaha”, or “that test was harddddddd” so I decided I should just snip out all words that are longer than 10 characters or so as well as all characters repeated more than three times.

OCaml’s stock regex engine kinda sucks, though, so the only way I could write the character limit would be something like “[A-Za-z0-9][A-Za-z0-9]...7 more...[A-Za-z0-9]+”. And that’s lame enough that I decided I ought to use a Perl-style regex. There’s a nice library and acompanying OCaml bindings and so I sat down to work, and promptly discovered I was somehow corrupting my UTF-8 data.

I knew running byte regexes over UTF-8 could be a problem, but I was careful to write them in such a way that it wouldn’t matter. Of course, the repeated-character one does matter, especially because I started out eating pairs. In general, something like s/(.)\1{2,}/$1/ will potentially eat UTF-8, but only in very specific situations: UTF-8 continuation bytes (those after the first) always begin with the two bits “10”, so any three or more byte character whose last six bits (that’s eight minus the two just mentioned) of the later bytes match will be identical bytes. (That means Unicode codepoints whose last 12 or 18 bits are repeats of some 6-bit string.)

So then I switched everything over to using UTF-8 regexes (it’s a good I was already using PCRE because the OCaml ones don’t do UTF-8), and fired it up, and...
segfaults. Lame. Bug in pcre or the OCaml bindings or something, but I didn’t really want to track it down.
It turned out that stuff like /\S+/ would do it (if the regex was compiled with the UTF8 flag), yet the equivalent /[^\s]+/ wouldn’t.

Anyway, it works now.

(Oh, and here’s my post of the output after a half a day.)