February 26th, 2004

  • evan

damn i love statically-typed languages

I had been thinking of switching my language-analyzer to use trigrams: basically, tweaking slightly a data structure that's at the core of all of the code, the sort of thing that would give me nightmares when working on (for example) LiveJournal. But all I had to do was (effectively) change type ngram = (int * int) to type ngram = (int * int * int) and step through the errors. As far as I can tell, it worked perfectly the first time.

Elsewhere: reading Knuth's TeXbook. Typesetting is hott.
  • evan

neat bug

Emboldened by how easy it was to add trigrams, I tried lowering the "skipped" threshold. (Before, the high threshold meant that I couldn't score a sixth of the posts.) I run the test, and find that some Russian-looking text came out identified as Chinese.

[skip forward a couple of hours poking around, thinking this was the karmic punishment for the crowing in the last post]

I finally discovered that the trigrams was getting scored with the "never-before-seen" probabilities of each language set; the way the math worked out, Chinese scored completely unknown text the highest.

So why were the trigrams unrecognized? Even I could identify it as Cyrillic. My hypothesis: someone with a Cyrillic keyboard just jammed on the keys randomly. Luckily, I'm in the computer science building, so I just poked my Russian friend across the room and asked him to take a look at it. He read it to me, and said it was an odd sentence (something about terrorism) but that it looked otherwise normal. I sat back down, dejected. And then, as he walked away, he mumbled, "Oh, and it was all in caps."

Captials! I already downcase all English-looking text, but I completely forgot about dealing with capitals for other character sets. I hadn't trained anything with any all-caps Russian or Belarusian, so the trigrams of three Cyrillic capitals would of course never have been seen.

Now to find some tables mapping capitals to lowercase...