Evan Martin (evan) wrote in evan_tech,
Evan Martin

corpus for minority languages


... In short, the goal of the software is the automatic development of large text corpora for minority languages.

The basic idea is to simultaneously bootstrap a text corpus and spell checking database (word list) for a given language. The key tool is the Google API. Words from an initial database (a few hundred words suffice) are fed to Google, which returns a list of sites containing these words. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques based on the initial word list is used to determine which documents (or sections thereof) are written in the desired language. These documents are added to the corpus, and further statistical analyses (frequencies, character n-grams) are used to find reliable candidate words for addition to the spell checking database. Repeat. Little or no human intervention is required.

  • livejournal kids

    Neat image from Jack Dorsey. Every so often someone will ask me about Twitter and I'll dig up a a random day from Brad's LJ in 1999 and talk about…

  • megaupload captcha

    Someone make a Javascript-based captcha cracker for megaupload. It's strange to see those captchas again because I idly myself wrote a…

  • zombie ghosd

    I was tickled to discover another IBM developerworks article on one of my abandoned hacks and that both it and its predecessor have been translated…

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment