Evan Martin (evan) wrote in evan_tech,
Evan Martin
evan
evan_tech

neat bug

Emboldened by how easy it was to add trigrams, I tried lowering the "skipped" threshold. (Before, the high threshold meant that I couldn't score a sixth of the posts.) I run the test, and find that some Russian-looking text came out identified as Chinese.

[skip forward a couple of hours poking around, thinking this was the karmic punishment for the crowing in the last post]

I finally discovered that the trigrams was getting scored with the "never-before-seen" probabilities of each language set; the way the math worked out, Chinese scored completely unknown text the highest.

So why were the trigrams unrecognized? Even I could identify it as Cyrillic. My hypothesis: someone with a Cyrillic keyboard just jammed on the keys randomly. Luckily, I'm in the computer science building, so I just poked my Russian friend across the room and asked him to take a look at it. He read it to me, and said it was an odd sentence (something about terrorism) but that it looked otherwise normal. I sat back down, dejected. And then, as he walked away, he mumbled, "Oh, and it was all in caps."

Captials! I already downcase all English-looking text, but I completely forgot about dealing with capitals for other character sets. I hadn't trained anything with any all-caps Russian or Belarusian, so the trigrams of three Cyrillic capitals would of course never have been seen.

Now to find some tables mapping capitals to lowercase...
Subscribe

  • dremel

    They published a paper on Dremel, my favorite previously-unpublished tool from the Google toolchest. Greg Linden discusses it: "[...] it is capable…

  • google ime

    Japanophiles might be interested to learn that Google released a Japanese IME. IME is the sort of NLP problem that Google is nearly uniquely…

  • ghc llvm

    I read this thesis on an LLVM backend for GHC, primarily because I was curious to learn more about GHC internals. The thesis serves well as an…

  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 14 comments

  • dremel

    They published a paper on Dremel, my favorite previously-unpublished tool from the Google toolchest. Greg Linden discusses it: "[...] it is capable…

  • google ime

    Japanophiles might be interested to learn that Google released a Japanese IME. IME is the sort of NLP problem that Google is nearly uniquely…

  • ghc llvm

    I read this thesis on an LLVM backend for GHC, primarily because I was curious to learn more about GHC internals. The thesis serves well as an…