01:44 pm, 22 Aug 05

### subsetting

Coding question:

Suppose you have a large (on-disk, too large for memory) hash table that you'd like to pull

Update: made it more clear, and added an answer here, after the cut and in white-on-white. The comments discuss the solution, so don't read 'em unless you want spoilers.

Suppose you have a large (on-disk, too large for memory) hash table that you'd like to pull

*n*random entries out of. You don't know the total number of entries, just that there are more than*n*. Assume you have an API like "`for key, value in table.stream(): ...`

" How do you do it?Update: made it more clear, and added an answer here, after the cut and in white-on-white. The comments discuss the solution, so don't read 'em unless you want spoilers.

This is related to a common interview question, and one that I think is sorta lame. You could maybe expect an interviewee to solve it with some probability and some induction when n=1, but it feels to me like it's a trick question.

(It is instructive to solve it for n=1, and the hint is to use induction on big-N, the length of the data.)

But the answer that

(I suppose you could argue this depends on the organization of your hash table and what it means to be "random"...)

(It is instructive to solve it for n=1, and the hint is to use induction on big-N, the length of the data.)

But the answer that

**rjpower**gave me recently sorta made me feel dumb (or, perhaps, too clever for my own good -- I was busily implmenting something more complicated): If your hashing function is good (which is sort of the point of a hash table), the keys are already in "random" order; just grab the first*n*and you're done.(I suppose you could argue this depends on the organization of your hash table and what it means to be "random"...)

bradIf so, it's easy.... just iterate over all rows, picking ones where int(rand(remaining_rows)) == 0.

But I imagine he doesn't know the size.

bradlithianathis seems to favour entries near the beginning of the stream, though, unless there's a way to vary the replacement probability...

xaosenkosmosIt actually favors either the initial N or a window of the past M entries (for some M). If S is the size of your set, and P is your probability, then:

P > N/S: Favors last M entries (M >= N/P i think)

P = N/S: fair(?)

P < N/S: Favors initial N entries (unlikely to pull N fresh elements out of the whole set)

If your P is too high, there is some large M such that, when you try M iters, you will eventually have more than N hits, thus any dataset S > M+N, the first N elements are less likely to be selected than the next M elements.

That said, i'd be happy with this solution as a first approximation. It's a good idea (better than any of my ideas, for sure), and i think its drawbacks will be relatively universal.

eqeI solved a similar problem, but I was updating the hash table, not just reading it, and I had control over how many entries I was updating at a time. (Basically, the problem I solved was "how do you store 2^40 entries in a hash table without doing a disk seek for every entry?") The main trick was to generate a large list, then sort on key. If $n$ is large enough that you average more than one entry per disk block, this saves you enormous latency.

If $n$ is small enough that you average less than one entry per disk block, the sorting technique should still be beneficial because near-track seeks are cheaper than cross-disk seeks.

For your actual problem, I'll assume that you know $N$ (the number of keys in your on-disk hash) and $n$ (the number of random entries you'd like). Then you simply calculate the ratio

r = (float)n/Nand generate a uniformly distributed random floating point value in [0,1] for each entry in your loop:x = drand48();. Thenif (x<r)select the current key. Quit once you've accumulated enough keys. (If the resulting deemphasis of the tail of the stream is a problem, then do the random selection all the way to the end of the stream and then randomly deselect enough entries to get down to $n$.) This leaves the possibility that you didn't select enough entries; in this case, fill in the blanks by chosing random indices and selecting those entries from the hash.There's probably a better technique where you set up a "hash cache" (a hash table with just $n$ pointers, without linked lists off of the table heads, so a hash collision results in the previous entry being dropped from the table) but it has the problem of requiring a hash function with data independence, which most hash functions don't have. (Well, I guess my above solution has that problem as well.)

lithianaevandan_erat## Classic problem

For a classic solution to the slightly more general problem - how do you maintain a random sample over a stream, see the paper on reservour sampling by Vitter.The basic premise is that if you want to maintain a sample of size s:

iterate over the hashtable

for each key: if you have seen less than s values so far, add it to the table.

: if you have seen more, with probability 1/s add it to the table and evict a random item, with probability (1 - 1/s) ignore it.

-anon lurker

dan_erat## Re: Classic problem

ITYM "with probability s/n add it to the table and evict a random item", where you've now seen n items.3smallishmagiReading the whole thing once to count the number of values, picking n random values out of that count and then reading through _again_ is correct but I'm sure not what you had in mind (it doesn't matter if you get a different ordering every pass).

If the hash table is suitably stored, you could just dive into n random locations in the disk and then scan though to the next start of a value. But you don't say I can do that.

I suppose my algorithm is:

first time:

pick the first n

remember the average size of the entries

mth time:

use the average size of the m*n entries and the disk size to guess the number of total entries.

pick n values out of guessed max.

for values (j) <= m*n, go back and read the j'th entry

for values >m*n, just read a few more values from where you left off reading last.

You get more an more inefficient as you keep feeding your canny statistician n values, but there's nothing you can do because he'd going to notice if you start clustering values ( i.e. you feed him one of the n and and he starts to win bets about what the other values will be.)

What a long rambly post. I wonder what's in the white text.

banana## Small point

Selecting the first`n`

has another (minor) good property. If the source hash table has`s`

entries and is too large to fit in memory, and`n`

may be any number up to`s-1`

, the result may not fit into memory. Transferring the first`n`

from the source to a target stream (i.e. disk) is still easy.