evan_tech

Previous Entry Share Next Entry
01:44 pm, 22 Aug 05

subsetting

Coding question:
Suppose you have a large (on-disk, too large for memory) hash table that you'd like to pull n random entries out of. You don't know the total number of entries, just that there are more than n. Assume you have an API like "for key, value in table.stream(): ..." How do you do it?

Update: made it more clear, and added an answer here, after the cut and in white-on-white. The comments discuss the solution, so don't read 'em unless you want spoilers.
This is related to a common interview question, and one that I think is sorta lame. You could maybe expect an interviewee to solve it with some probability and some induction when n=1, but it feels to me like it's a trick question.
(It is instructive to solve it for n=1, and the hint is to use induction on big-N, the length of the data.)

But the answer that rjpower gave me recently sorta made me feel dumb (or, perhaps, too clever for my own good -- I was busily implmenting something more complicated): If your hashing function is good (which is sort of the point of a hash table), the keys are already in "random" order; just grab the first n and you're done.

(I suppose you could argue this depends on the organization of your hash table and what it means to be "random"...)