January 24th, 2009

  • evan

megaupload captcha

Someone make a Javascript-based captcha cracker for megaupload. It's strange to see those captchas again because I idly myself wrote a captcha-cracker for the same site. I don't think I ever announced it here because there wasn't any real application for it; I was just playing around.

According to the repo history it was two years ago, and the notes indicate how far I went before losing interest. As I recall their captcha image was quite weak, but you also needed to be able to interpret Javascript and fake out the DOM to skip a timed countdown.
  • evan

adventures in search optimization

At work we heavily use a tool for rapidly searching the codebase with regexps. (I believe it's the same engine behind Google Code Search.) But it isn't as helpful when your index is stale, which happens more often when your code churns rapidly (as Chromium does -- ~50 commits a day by us and ~50 more commits a day from WebKit). I had been idly thinking about making some sort of full-text indexing system for working with my own code that was smart about watching for file modifications, building generations of indexes, etc. It sounded like a fun project.

Then I just chatted with Matt a bit about it and realized I was thinking about it wrong. His first thought was that there shouldn't be that much data involved here and I should be able to just do a brute-force scan.

But there's too much data, I protested. To start with, find across my tree with a cold disk takes tens of seconds to even enumerate all the files (over 200,000). And Visual Studio's "find in files" is also super-slow, supporting my intuition. But upon second glance that number is immediately suspicious as way too many files -- because it's including all sorts of files I don't care about! As far as source goes it's only around 10,000 source files and under 50mb of data.

So what's faster? Git knows which files are in the repo. (A great trick for finding files by name more quickly than using find is git ls-files <pattern>.) Then I just want to limit my grep to source files. I had thought git grep didn't let you specify a filename pattern (which is why I was fooling around with find and grep in the first place) but after rereading the source I see I overlooked it in the man page; something like git grep foo -- *.cc *.h does exactly what I wanted. It's easy enough to stuff in a one-liner shell script so git gs foo quickly searches my code.

On a cold disk (after flushing the cache), it takes ~11s on my laptop. But as soon as the disk is warm (and it's only 50mb of data to keep around anyway) it's 0.35s, which is plenty fast. I note that whoever wrote the grep support for git was clever enough to shell out to grep (unless your combination of passed-in flags prevents it), because you are unlikely to beat grep.

(PS/update: it turns out that M-x vc-git-grep hooks all this into the existing Emacs grep support, complete with shorthand for specifying "file extensions that look like C++ source or headers". I am humbled.)