August 22nd, 2005

  • evan


Coding question:
Suppose you have a large (on-disk, too large for memory) hash table that you'd like to pull n random entries out of. You don't know the total number of entries, just that there are more than n. Assume you have an API like "for key, value in ..." How do you do it?

Update: made it more clear, and added an answer here, after the cut and in white-on-white. The comments discuss the solution, so don't read 'em unless you want spoilers.
Collapse )
  • evan

NIST results

NIST 2005 Machine Translation Evaluation Official Results. As I mentioned before, Franz & co. (congrats again, hawk!) totally rocked it.

But it's worth noting that (at least according to Franz's papers from before Google; I don't know much about what they're actually doing here) part of his approach is use the BLEU score as the objective function in their learning. This does make sense: the BLEU score was designed to correlate with human translation quality and so it's a reasonable function to optimize. And the sentences they were given to translate must have been entirely separate from all available training data. But still, it feels a little weird to me that you'd optimize on the metric used to judge; it means you can make "simple" translation mistakes (at least to a human observer) but still get a good score as long as the scoring function doesn't account for those sort of mistakes.