Evan Martin (evan) wrote in evan_tech,
Evan Martin
evan
evan_tech

subsetting

Coding question:
Suppose you have a large (on-disk, too large for memory) hash table that you'd like to pull n random entries out of. You don't know the total number of entries, just that there are more than n. Assume you have an API like "for key, value in table.stream(): ..." How do you do it?

Update: made it more clear, and added an answer here, after the cut and in white-on-white. The comments discuss the solution, so don't read 'em unless you want spoilers.
This is related to a common interview question, and one that I think is sorta lame. You could maybe expect an interviewee to solve it with some probability and some induction when n=1, but it feels to me like it's a trick question.
(It is instructive to solve it for n=1, and the hint is to use induction on big-N, the length of the data.)

But the answer that rjpower gave me recently sorta made me feel dumb (or, perhaps, too clever for my own good -- I was busily implmenting something more complicated): If your hashing function is good (which is sort of the point of a hash table), the keys are already in "random" order; just grab the first n and you're done.

(I suppose you could argue this depends on the organization of your hash table and what it means to be "random"...)
Tags: fourth wall, friends
Subscribe

  • google ime

    Japanophiles might be interested to learn that Google released a Japanese IME. IME is the sort of NLP problem that Google is nearly uniquely…

  • megaupload captcha

    Someone make a Javascript-based captcha cracker for megaupload. It's strange to see those captchas again because I idly myself wrote a…

  • zombie ghosd

    I was tickled to discover another IBM developerworks article on one of my abandoned hacks and that both it and its predecessor have been translated…

  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 12 comments

  • google ime

    Japanophiles might be interested to learn that Google released a Japanese IME. IME is the sort of NLP problem that Google is nearly uniquely…

  • megaupload captcha

    Someone make a Javascript-based captcha cracker for megaupload. It's strange to see those captchas again because I idly myself wrote a…

  • zombie ghosd

    I was tickled to discover another IBM developerworks article on one of my abandoned hacks and that both it and its predecessor have been translated…