Evan Martin (evan) wrote in evan_tech,
Evan Martin

corpus for minority languages


... In short, the goal of the software is the automatic development of large text corpora for minority languages.

The basic idea is to simultaneously bootstrap a text corpus and spell checking database (word list) for a given language. The key tool is the Google API. Words from an initial database (a few hundred words suffice) are fed to Google, which returns a list of sites containing these words. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques based on the initial word list is used to determine which documents (or sections thereof) are written in the desired language. These documents are added to the corpus, and further statistical analyses (frequencies, character n-grams) are used to find reliable candidate words for addition to the spell checking database. Repeat. Little or no human intervention is required.

  • blog moved

    As described elsewhere, I've quit LiveJournal. If you're interested in my continuing posts, you should look at one of these (each contains feed…

  • dremel

    They published a paper on Dremel, my favorite previously-unpublished tool from the Google toolchest. Greg Linden discusses it: "[...] it is capable…

  • treemaps

    I finally wrote up my recent adventures in treemapping, complete with nifty clickable visualizations.

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment