Previous Entry Share Next Entry
06:52 pm, 26 Feb 04

neat bug

Emboldened by how easy it was to add trigrams, I tried lowering the "skipped" threshold. (Before, the high threshold meant that I couldn't score a sixth of the posts.) I run the test, and find that some Russian-looking text came out identified as Chinese.

[skip forward a couple of hours poking around, thinking this was the karmic punishment for the crowing in the last post]

I finally discovered that the trigrams was getting scored with the "never-before-seen" probabilities of each language set; the way the math worked out, Chinese scored completely unknown text the highest.

So why were the trigrams unrecognized? Even I could identify it as Cyrillic. My hypothesis: someone with a Cyrillic keyboard just jammed on the keys randomly. Luckily, I'm in the computer science building, so I just poked my Russian friend across the room and asked him to take a look at it. He read it to me, and said it was an odd sentence (something about terrorism) but that it looked otherwise normal. I sat back down, dejected. And then, as he walked away, he mumbled, "Oh, and it was all in caps."

Captials! I already downcase all English-looking text, but I completely forgot about dealing with capitals for other character sets. I hadn't trained anything with any all-caps Russian or Belarusian, so the trigrams of three Cyrillic capitals would of course never have been seen.

Now to find some tables mapping capitals to lowercase...