I've had a first shot at doing this sitting around for months, but I got frustrated trying to make all the pieces fit together. (I've heard it estimated that machine learning is 90% massaging data into the right formats and 5% algorithms. And the algorithm is provided by a library anyway.) Today I finally made it work. I'd normally put the code online but it's particularly hideous.
I was trying to do the data munging with Ruby but it was super slow. (I'm not really sure what I was doing wrong, but a simple loop that inserts all words from entries into a hash table could only process 50 entries a second and gets slower as it goes on -- the hash only holds 10k elements?) But I actually had a lot of code hanging around in C++ (see previous post), and C++ isn't too bad if you've got good libraries to support it.
Anyway, it suggested an old post from over two years ago should be tagged haskell. The post is a pretty good candidate, really: some of my initial impressions of the language, along with some reflections on human language. It's funny to see how little I've changed in these respects over the past two years. (But at the same time, that was before Google, which has altered my perspective on so many things!)
As always this makes me wish I knew more about what was going on. This classifier (libsvm: it was even recommended by my officemate, who's behind TinySVM) outputs a yes/no value, and what I really want are some sorts of relative scores: bad suggestions are ok as long as they're ranked below good ones. I guess the difference is classification versus regression? The libsvm docs peter out right at the point where they'd start explaining these sorts of things.
Additionally (and this may just be Google's influence spoiling me) I'm surprised people can get useful results out of SVMs, given that they work on input data volumes ranging "only" in the thousands. I remember a talk by a guy who claimed he could generally beat SVMs using logistic regression because he could throw much more data at it and crunch new approaches so much faster. Does anyone know a good logistic regression library? (The first hit for [sparse logistic regression library] turns out to be software by Paul [previously-mentioned speaker] himself! Some poking around the site turned up the slides from the talk he gave.) Maybe it'd be more educational to implement it myself...