... In short, the goal of the software is the automatic development of large text corpora for minority languages.
The basic idea is to simultaneously bootstrap a text corpus and spell checking database (word list) for a given language. The key tool is the Google API. Words from an initial database (a few hundred words suffice) are fed to Google, which returns a list of sites containing these words. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques based on the initial word list is used to determine which documents (or sections thereof) are written in the desired language. These documents are added to the corpus, and further statistical analyses (frequencies, character n-grams) are used to find reliable candidate words for addition to the spell checking database. Repeat. Little or no human intervention is required.
corpus for minority languages
As described elsewhere, I've quit LiveJournal. If you're interested in my continuing posts, you should look at one of these (each contains feed…
They published a paper on Dremel, my favorite previously-unpublished tool from the Google toolchest. Greg Linden discusses it: "[...] it is capable…
I finally wrote up my recent adventures in treemapping, complete with nifty clickable visualizations.