September 9th, 2008

  • evan

gat, a git clone in haskell

I've been pretty busy with work lately, so I may as well dump this on the internet before it gets too dusty. Though I think I understand Git decently enough now, I've been curious about the internals. So I sat down to start reading and documenting, for example, the on-disk pack file format. (Git includes some technical docs but they're pretty thin.) But then I started writing some Haskell code for fun and to verify my understanding was correct and before I knew it I had 5% of a Git clone written in Haskell. That is, not a "porcelain" (a frontend on top of their utilities) but a separate implementation that works with the on-disk format.

So, gat. It can read the index, loose objects, and a bit of pack files; it has an implementation of the delta decoder so it can read delta-compressed objects out of the pack format; it understands how to convert git-svn^^ into reading refs/remotes/git-svn, recursively parsing out its parent pointers, and decoding that; it can do Git-style color diffs between various trees and commits. And finally and most importantly: it does some mmap()ing but is not at all fast, it is incorrect in the pieces it has implemented, and it does no writing of any data structures whatsoever -- that is to say, nothing useful.

But it's been fun to play with, at least. Maybe I'll clean up those docs and post them too at some point.

A few notes from reading Git code:
  • Git is a jumble of random nearly-commentless code, full of globals and strange state and not at all clear control flows. On the other hand, it's also much more Unixy than the code I'm used to reading, doing all sorts of tricks like using mmap() instead of read() (because the latter just involves an extra copy, y'know?) and forking. I am simultaneously impressed and terrified of what's likely going on in my kernel.
  • To align some disk-based data structure Git uses this code:
    (offsetof(struct cache_entry,name) + (len) + 8) & ~7
    That's an off-by-one error (gets eight bytes of padding where zero would do) that I'm a little puzzled by -- maybe it's intentional? Maybe it's remained for backwards compatibility?
  • This goto also makes no sense to me:
        ret = ent->data;
        if (ret && ent->p == p && ent->base_offset == base_offset)
          goto found_cache_entry;
        return unpack_entry(p, base_offset, type, base_size);
    (Those are the only two instances of found_cache_entry in the file...)
  • Git contains at least three incompatible bit-packed integer encodings, found in pack file offsets, object headers, and the delta format.
  • It's crazy to me that in 2008 people are still writing code that is manually passing around buffers and lengths and verifying by hand that the space works out.
  • less can take all sorts of useful flags, causing it to e.g. only page if necessary, allow color but not other control codes through, and more! When Git spawns less as a pager, it actually execs less in the parent side of the fork, making itself the child of the less process; dear Unix people, explain to me why you'd do that? (I have my thoughts, but I'm curious about yours.)