evan_tech

01:44 pm, 22 Aug 05

subsetting

Coding question:
Suppose you have a large (on-disk, too large for memory) hash table that you'd like to pull n random entries out of. You don't know the total number of entries, just that there are more than n. Assume you have an API like "`for key, value in table.stream(): ...`" How do you do it?

Update: made it more clear, and added an answer here, after the cut and in white-on-white. The comments discuss the solution, so don't read 'em unless you want spoilers.
This is related to a common interview question, and one that I think is sorta lame. You could maybe expect an interviewee to solve it with some probability and some induction when n=1, but it feels to me like it's a trick question.
(It is instructive to solve it for n=1, and the hint is to use induction on big-N, the length of the data.)

But the answer that gave me recently sorta made me feel dumb (or, perhaps, too clever for my own good -- I was busily implmenting something more complicated): If your hashing function is good (which is sort of the point of a hash table), the keys are already in "random" order; just grab the first n and you're done.

(I suppose you could argue this depends on the organization of your hash table and what it means to be "random"...)

•  Exactly what I was going to ask. :)If so, it's easy.... just iterate over all rows, picking ones where int(rand(remaining_rows)) == 0.But I imagine he doesn't know the size. reply to this
•  (scaling probability for n, and stopping early when n is 0....) reply to this
•  i don't know a definitive answer to this, but the obvious solution seems to be something like: store the first N sequential entries into a table, then with some probability on each iteration, drop and replace a random entry in the table with the last entry read from the stream.this seems to favour entries near the beginning of the stream, though, unless there's a way to vary the replacement probability... reply to this
•  (My statistics/probability background sucks, so the following might be wrong; I welcome corrections)It actually favors either the initial N or a window of the past M entries (for some M). If S is the size of your set, and P is your probability, then:P > N/S: Favors last M entries (M >= N/P i think)P = N/S: fair(?)P < N/S: Favors initial N entries (unlikely to pull N fresh elements out of the whole set)If your P is too high, there is some large M such that, when you try M iters, you will eventually have more than N hits, thus any dataset S > M+N, the first N elements are less likely to be selected than the next M elements.That said, i'd be happy with this solution as a first approximation. It's a good idea (better than any of my ideas, for sure), and i think its drawbacks will be relatively universal. reply to this
•  Hmm, I completely mis-read your problem statement, so here's what I wrote:I solved a similar problem, but I was updating the hash table, not just reading it, and I had control over how many entries I was updating at a time. (Basically, the problem I solved was "how do you store 2^40 entries in a hash table without doing a disk seek for every entry?") The main trick was to generate a large list, then sort on key. If \$n\$ is large enough that you average more than one entry per disk block, this saves you enormous latency.If \$n\$ is small enough that you average less than one entry per disk block, the sorting technique should still be beneficial because near-track seeks are cheaper than cross-disk seeks.For your actual problem, I'll assume that you know \$N\$ (the number of keys in your on-disk hash) and \$n\$ (the number of random entries you'd like). Then you simply calculate the ratio r = (float)n/N and generate a uniformly distributed random floating point value in [0,1] for each entry in your loop: x = drand48();. Then if (x
•  hmm. i'm not convinced that's "random", since the result can be determined from the value of the input data... surely a random set should have some external factor, as well? reply to this
•  Hm, that's a fair point. But it is random the first time, as long as you don't really know the relationship between the data and the hash function involved. reply to this
•  Ever since I heard this question, I implement the probabilistic solution in code whenever I get the chance. :) reply to this
• Anonymous

Classic problem

For a classic solution to the slightly more general problem - how do you maintain a random sample over a stream, see the paper on reservour sampling by Vitter.

The basic premise is that if you want to maintain a sample of size s:
iterate over the hashtable
for each key: if you have seen less than s values so far, add it to the table.
: if you have seen more, with probability 1/s add it to the table and evict a random item, with probability (1 - 1/s) ignore it.

-anon lurker
Selecting the first `n` has another (minor) good property. If the source hash table has `s` entries and is too large to fit in memory, and `n` may be any number up to `s-1`, the result may not fit into memory. Transferring the first `n` from the source to a target stream (i.e. disk) is still easy.