April 30th, 2006

  • evan

c++ type overloading pattern + some haskell

I saw this pattern in some work code but I find it applying in a bunch of places. (I'm not much of a C++ wizard so maybe this is old news to you, but it impressed me.)

This comes up in situations where you have a function that wants to take a bunch of parameters of varying types: imagine printf, or something that wants to return an unknown number of parameters. Here's an example of the latter: you've just retrieved a row from a database, and you know that the first element is a string, the second an int, etc. The typical thing to do (which I do in LogJam, for example) is write it out:
// imagine "SELECT id, user FROM ..."
int id = sqlite3_column_int(stmt, 0);
const char *user = sqlite3_column_text(stmt, 1);

Tedious, and you have to update the numbers if you rearrange the select.

Here's the pattern. First you make a class that can be constructed with all the different types you'd like to accept:
class OutParam {
  // my kingdom for a discriminated union
  enum { UNUSED, INT, STRING /* etc */ } type_;
  union {
    int *int_out_;
    string *string_out_;
  };
  OutParam() : type_(UNUSED) {}
  OutParam(int *i) : type_(INT), int_out_(i) {}
  OutParam(string *s) : type_(STRING), string_out_(s) {}
  void get(const Row &row, int col) {
    switch (type_) {
    case UNUSED: break;
    case INT: *int_out_ = row.get_int(col); break;
    case STRING: row.get_string(col, string_out_);
    }
  }
};
static const OutParam UNUSED;
Then you can provide a function that accepts some of these objects and fills them out accordingly:
class Row {
  // get_int and friends wrapping the sqlite API
  void get(OutParam p0, OutParam p1=UNUSED,
           OutParam p1=UNUSED, OutParam p2=UNUSED) {
    p0.get(*this, 0);
    p1.get(*this, 1);
    p2.get(*this, 2);
    p3.get(*this, 3);
  }
};
This now allows the user to retrieve all the fields in one go while retaining static typing:
int id; string user;
row = /* something that runs the select, as above */;
row->get(&id, &user);
If you wanna see the full implementation, browse my repository.

(Note that this is more powerful than the tuple-returns you see in today's high-level languages, because it knows which fields are ints and which are strings.

I feel like I've written code like this a million times:
foo, bar, baz = ARGV[0].to_i, ARGV[1], ARGV[2]
where
foo, bar, baz = ARGV
would almost do except for the to_i.)

I suppose I really ought to read through the boost code before I get too self-congratulatory about this (I just found this in someone else's code, so my only real pride in it is taking the effort to type it out). And now that I think about it, monotone is written by C++ hax0rs and uses sqlite so they probably do something clever there, too...


I recently saw a similar pattern a vaguely related idea that's even fancier in Haskell. Haskell lets you overload functions by return type, which is even weirder with type inference because it means that the behavior of a function call can effectively depend on how you use the variables in the code afterwards. So their library let you write code like:
let (verbose, filename) = getOptions ('v', 'f')
when verbose $ putStrLn "Starting up..."
open filename    {- or whatever -}
Since verbose is used in a when test, the compiler knows it's a Bool, which means that that getOptions knows that the v flag takes no argument. Similar reasoning lets it conclude that the f flag requires a string argument, and so getOptions can nearly magically know which of the flags you give it need which sort of arguments and do the appropriate error-checking and reporting.

It's a pretty sweet and sick hack. (They do some similarly sick stuff for flag documentation.) Unfortunately the implementation is a bit on the sick side too...
  • evan

autotagging lj posts with an svm

I always forget to put tags on my LJ posts, but I figure the computer could do it (or at least suggest them) for me. It's a pretty standard machine learning situation: old posts with tags = labelled data, new post = classification problem. In fact, this application is partially what I had in mind in making those LiveJournal Ruby bindings with exporting.

I've had a first shot at doing this sitting around for months, but I got frustrated trying to make all the pieces fit together. (I've heard it estimated that machine learning is 90% massaging data into the right formats and 5% algorithms. And the algorithm is provided by a library anyway.) Today I finally made it work. I'd normally put the code online but it's particularly hideous.

I was trying to do the data munging with Ruby but it was super slow. (I'm not really sure what I was doing wrong, but a simple loop that inserts all words from entries into a hash table could only process 50 entries a second and gets slower as it goes on -- the hash only holds 10k elements?) But I actually had a lot of code hanging around in C++ (see previous post), and C++ isn't too bad if you've got good libraries to support it.

Anyway, it suggested an old post from over two years ago should be tagged haskell. The post is a pretty good candidate, really: some of my initial impressions of the language, along with some reflections on human language. It's funny to see how little I've changed in these respects over the past two years. (But at the same time, that was before Google, which has altered my perspective on so many things!)


As always this makes me wish I knew more about what was going on. This classifier (libsvm: it was even recommended by my officemate, who's behind TinySVM) outputs a yes/no value, and what I really want are some sorts of relative scores: bad suggestions are ok as long as they're ranked below good ones. I guess the difference is classification versus regression? The libsvm docs peter out right at the point where they'd start explaining these sorts of things.

Additionally (and this may just be Google's influence spoiling me) I'm surprised people can get useful results out of SVMs, given that they work on input data volumes ranging "only" in the thousands. I remember a talk by a guy who claimed he could generally beat SVMs using logistic regression because he could throw much more data at it and crunch new approaches so much faster. Does anyone know a good logistic regression library? (The first hit for [sparse logistic regression library] turns out to be software by Paul [previously-mentioned speaker] himself! Some poking around the site turned up the slides from the talk he gave.) Maybe it'd be more educational to implement it myself...