I’ve been storing the letter counts as bigrams across bytes because there’s a nice data structure for that: a 256x256 array of OCaml floats is under a megabyte. The problem is that with UTF-8 data, Cyrillic characters are two (or is it three?) bytes each, so two Cyrillic languages look more or less similar to byte bigrams. There are two approaches to take:
- Use Unicode codepoints instead of bytes. Now my arrays would need to be 220x220. Too big.
- Use trigrams. Now I need 3d arrays. Too big.
I had breakfast and later went to a party with Christophe from Google last weekend. (He’s trying to recruit me, I guess, and had all sorts of great stories about the environment there, how he’s moving to New York for a few months just for fun, etc.) And it turns out he’s been working on a very similar problem.
He works in Google’s “quality” department, which concerns itself with returning better search results (as opposed to, for example, the hardware engineering departments, etc.), and because of that he’s not allowed to talk much about what they do. (He told me that they even request people working in quality don’t go to search-related conventions for fear of them socializing with people from other companies.) But he did tell me a few things, including that his current project is working on guessing the language for pages that don’t have very much text. And he had some super-vague technical details about what Google does... but I think their approach is keyed to pages where the character set is also unknown.
My LJ analysis is much easier because my data is guaranteed UTF-8, which makes me think that I ought to go for approach one and just use hash tables. Interestingly, the problems he dealt with were quite different than mine: he talked about two northern European languages and how similar they are (they all look like German with funny spelling to me) and how it’s nearly impossible to distinguish them. I’m still in the easy stages of distinguishing English from German; I don’t look forward to encountering the hard problems.