Additionally, my Ruby speed problem was, as I suspected, operator error. (I was forgetting to return the right thing from a function so it was returning a large data structure -- the perils of switching between static typing and not!) It's definitely slower than C++ but tolerably slower.
Here it is predicting the top 3 tags for some recent posts of mine:
% ./tag-predict -m model.db -d evan_tech.db predict 688
subject: packrat parsing: clever
#1: go read: 0.085
#2: haskell: 0.052
#3: free software: 0.033
% ./tag-predict -m model.db -d evan_tech.db predict 660
subject: gtk 2.10 overview
#1: free software: 0.080
#2: go read: 0.045
#3: project: 0.036
Using logistic regression for this sort of task is pretty simple. Here's some background if you're not into this sort of thing:
Logistic regression takes a bunch of examples and produces a classifier that, given a new instance, outputs a probability ranging from 0 (no) to 1 (yes).
Examples are represented as a collection of features, which are binary-valued. Making this useful is a little weird to wrap your head around at first. Some example features for LJ posts look like "is the word 'haskell' in the subject?" or "does the post link to cnn.com?" One of the limitations of this approach is that you can't have features like "how long is the post?", as that is not a binary question; instead, you can hack it by multiple separate features like "is the post longer than average?" and "is the post in the top quartile of post lengths?" On the other hand, one of the nice things about logistic regression is it can work with a huge number of features -- from my examples above you'll have one per word you've ever used and one for every site you've ever linked to. (By contrast, SVMs work with real-valued inputs, but are more limited in their number of computable features.)
For this application, I then build one model per tag, so the output is the probability the tag applies: near 1 means the tag likely applies, near 0 means it likely doesn't, and 0.5 means no opinion. The math works out such that it's impossible for it to have a value of exactly 1 or 0, which is reassuring: it can never be completely positive of its answer.
The resulting model comes out to basically each feature having a score, and the score of a instance is just a function of the sum of the scores of all the instance's features. (By contrast, the SVM model is much more complicated, which means it can capture a lot more subtlety than a simple combination.)
For my data, it takes about 3 seconds to run those predictions, but the current bottleneck is actually serializing everything in and out of a database and not the math. Yes, a database is not the right thing to be using for this, but while testing it allows exploratory queries like this:
"What the the 10 features that most affect whether the post should be tagged 'programming languages'?"
sqlite> select f.name, mf.weight from model m, features f, model_features mf where m.name='programming languages' and m.id=mf.model_id and f.id=mf.feature_id order by abs(mf.weight) desc limit 10;
constant|-3.66346016252252
b2:lj|-0.224808992868228
s:how|0.221666162107147
b:clr|0.215414279817695
b2:my|-0.183794881973486
b2:language|0.180736339915022
b2:functional|0.169162208567001
b:from|-0.167727240605835
b:aol|0.164069963918593
b:ism|0.161124210509422
Here you can see the names I've given my features. The ones starting "b" are the workhorses: did the word occur in the body of the post? Explaining the others will also explain some other concepts.
- "b2:foo" means "did foo occur more than twice in the post?" As I mentioned before, this is the hack to get around the fact that features are binary. You'll also notice that "b2:language" is in the top ten, but that "b:language" didn't make it -- presumably, I use the word "language" all over the place. This is one reason it can be dangerous to read these models directly: even if "language" generally occurs in programming languages posts, it could work out such that the b2 feature covers it and that the b feature could even end up negative. As I understand it, it's pretty much impossible to read the models from SVMs so being able to at least get this much out of an LR model is something to be grateful for.
- "constant" is a coefficient that should always be included in the sum. You can see that it is large and negative, which implies that lacking any other evidence, "programming languages" doesn't apply. I think this is because I trained on all of my posts, and most of them are not tagged.
Part of my goal in making this was to make it relatively easy to add new features, with the thought that maybe some of y'all could contribute ideas. I mentioned cnn.com before: I'd imagine it'd be very strong for whether your post is about news. You can do all sorts of other wacky things, like "does this post contain a
<pre>
tag?" or "did this post get any comments?"Anyway, actually achieving that ought to be as simple as checking out the code and modifying the
featurize_entry
function. Here's how it adds features for all the words in the subject of the entry:if entry.subject entry.subject.downcase.scan(/\w+/) do |w| features[intern.intern('s:' + w)] = 1 end end(That "intern" bit probably isn't actually necessary anymore, but that's how I wrote it...)