... In short, the goal of the software is the automatic development of large text corpora for minority languages.
The basic idea is to simultaneously bootstrap a text corpus and spell checking database (word list) for a given language. The key tool is the Google API. Words from an initial database (a few hundred words suffice) are fed to Google, which returns a list of sites containing these words. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques based on the initial word list is used to determine which documents (or sections thereof) are written in the desired language. These documents are added to the corpus, and further statistical analyses (frequencies, character n-grams) are used to find reliable candidate words for addition to the spell checking database. Repeat. Little or no human intervention is required.
corpus for minority languages
Neat image from Jack Dorsey. Every so often someone will ask me about Twitter and I'll dig up a a random day from Brad's LJ in 1999 and talk about…
I was tickled to discover another IBM developerworks article on one of my abandoned hacks and that both it and its predecessor have been translated…