Evan Martin (evan) wrote in evan_tech,
Evan Martin

corpus for minority languages


... In short, the goal of the software is the automatic development of large text corpora for minority languages.

The basic idea is to simultaneously bootstrap a text corpus and spell checking database (word list) for a given language. The key tool is the Google API. Words from an initial database (a few hundred words suffice) are fed to Google, which returns a list of sites containing these words. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques based on the initial word list is used to determine which documents (or sections thereof) are written in the desired language. These documents are added to the corpus, and further statistical analyses (frequencies, character n-grams) are used to find reliable candidate words for addition to the spell checking database. Repeat. Little or no human intervention is required.

  • livejournal kids

    Neat image from Jack Dorsey. Every so often someone will ask me about Twitter and I'll dig up a a random day from Brad's LJ in 1999 and talk about…

  • remote-controlling windows

    My coworker Tony's solution to the "Windows-based project" problem is to instead work from Linux, using rdesktop to make one panel of his ion…

  • return to windowsland

    My project peripherally involves Windows. It's been (as I've been saying to everyone) about eight years since I last used Windows on any machine of…

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment