evan_tech

01:44 pm, 22 Aug 05

subsetting

Coding question:
Suppose you have a large (on-disk, too large for memory) hash table that you'd like to pull n random entries out of. You don't know the total number of entries, just that there are more than n. Assume you have an API like "for key, value in table.stream(): ..." How do you do it?

Update: made it more clear, and added an answer here, after the cut and in white-on-white. The comments discuss the solution, so don't read 'em unless you want spoilers.
This is related to a common interview question, and one that I think is sorta lame. You could maybe expect an interviewee to solve it with some probability and some induction when n=1, but it feels to me like it's a trick question.
(It is instructive to solve it for n=1, and the hint is to use induction on big-N, the length of the data.)

But the answer that gave me recently sorta made me feel dumb (or, perhaps, too clever for my own good -- I was busily implmenting something more complicated): If your hashing function is good (which is sort of the point of a hash table), the keys are already in "random" order; just grab the first n and you're done.

(I suppose you could argue this depends on the organization of your hash table and what it means to be "random"...)

•  Anonymous reply to this
• Exactly what I was going to ask. :)If so, it's easy.... just iterate over all rows, picking ones where int(rand(remaining_rows)) == 0.But I imagine he doesn't know the size. reply to this
• (scaling probability for n, and stopping early when n is 0....) reply to this
• i don't know a definitive answer to this, but the obvious solution seems to be something like: store the first N sequential entries into a table, then with some probability on each iteration, drop and replace a random entry in the table with the last entry read from the stream.this seems to favour entries near the beginning of the stream, though, unless there's a way to vary the replacement probability... reply to this
• (My statistics/probability background sucks, so the following might be wrong; I welcome corrections)It actually favors either the initial N or a window of the past M entries (for some M). If S is the size of your set, and P is your probability, then:P > N/S: Favors last M entries (M >= N/P i think)P = N/S: fair(?)P < N/S: Favors initial N entries (unlikely to pull N fresh elements out of the whole set)If your P is too high, there is some large M such that, when you try M iters, you will eventually have more than N hits, thus any dataset S > M+N, the first N elements are less likely to be selected than the next M elements.That said, i'd be happy with this solution as a first approximation. It's a good idea (better than any of my ideas, for sure), and i think its drawbacks will be relatively universal. reply to this
• Hmm, I completely mis-read your problem statement, so here's what I wrote:I solved a similar problem, but I was updating the hash table, not just reading it, and I had control over how many entries I was updating at a time. (Basically, the problem I solved was "how do you store 2^40 entries in a hash table without doing a disk seek for every entry?") The main trick was to generate a large list, then sort on key. If \$n\$ is large enough that you average more than one entry per disk block, this saves you enormous latency.If \$n\$ is small enough that you average less than one entry per disk block, the sorting technique should still be beneficial because near-track seeks are cheaper than cross-disk seeks.For your actual problem, I'll assume that you know \$N\$ (the number of keys in your on-disk hash) and \$n\$ (the number of random entries you'd like). Then you simply calculate the ratio r = (float)n/N and generate a uniformly distributed random floating point value in [0,1] for each entry in your loop: x = drand48();. Then if (x
• hmm. i'm not convinced that's "random", since the result can be determined from the value of the input data... surely a random set should have some external factor, as well? reply to this
•  Hm, that's a fair point. But it is random the first time, as long as you don't really know the relationship between the data and the hash function involved. reply to this
•  Ever since I heard this question, I implement the probabilistic solution in code whenever I get the chance. :) reply to this
• Anonymous

Classic problem

For a classic solution to the slightly more general problem - how do you maintain a random sample over a stream, see the paper on reservour sampling by Vitter.

The basic premise is that if you want to maintain a sample of size s:
iterate over the hashtable
for each key: if you have seen less than s values so far, add it to the table.
: if you have seen more, with probability 1/s add it to the table and evict a random item, with probability (1 - 1/s) ignore it.

-anon lurker
• Well hashes usually try to be random, so the first n might do just fine. If you want another n, the next n are probably ok too, though that's now no longer perfectly correct.Reading the whole thing once to count the number of values, picking n random values out of that count and then reading through _again_ is correct but I'm sure not what you had in mind (it doesn't matter if you get a different ordering every pass).If the hash table is suitably stored, you could just dive into n random locations in the disk and then scan though to the next start of a value. But you don't say I can do that.I suppose my algorithm is:first time: pick the first n remember the average size of the entriesmth time: use the average size of the m*n entries and the disk size to guess the number of total entries. pick n values out of guessed max. for values (j) <= m*n, go back and read the j'th entry for values >m*n, just read a few more values from where you left off reading last.You get more an more inefficient as you keep feeding your canny statistician n values, but there's nothing you can do because he'd going to notice if you start clustering values ( i.e. you feed him one of the n and and he starts to win bets about what the other values will be.)What a long rambly post. I wonder what's in the white text. reply to this
•  