evan_tech

Previous Entry Share Next Entry
10:13 am, 21 Oct 07

What you need to know about git

This is a braindump about git. I don't describe commands, like how to add and remove files, because they provide tutorials as well as translation guides for users of other tools. Instead, I'm trying to describe what distinguishes git from other VCSes in the same space (sort of a hobby of mine, see more posts) and what makes it interesting. This is the sort of thing I wish I could've read when I first started looking into this.

The place to start is at the philosophy, because that makes a lot of other decisions more clear. Imagine you're Linus. You're managing a huge code base with many contributors, but you don't really care much about version control as it currently exists -- you've been content in the past managing the whole thing by just patches and email. What do you need to get more work done with less pain?

With that in mind, git to me feels much more like a "content tracker" (their term) than a "version control system". It starts with a content-addressable file system as its primitive and then adds the minimal layer of glue on top of it to support some work flow, but above all the focus is speed and simplicity. For example, "git clone" is a shell script that calls curl and rsync, among others. For another example, git more or less doesn't handle renames in any sort of principled way, nor do they care much about merge algorithms -- I recall reading some thread where Linus argued that it's better to conflict than do something fancy because you want a human to look over it anyway in cases where the merge starts getting fancy. The whole thing still feels super weird to me because the similarity to monotone is obvious but the end result is really different.


object store
Let's start with the repository representation. (You can skip this section if you get monotone or mercurial, because as I understand it this is the bit of monotone that everyone borrowed.) Through underlying representations you don't need to care much about, imagine you can store and retrieve blobs of bytes keyed by their SHA-1. You can build pointers between blobs by storing collections of SHA-1s, which themselves get new SHA-1s representing the collection. Suppose you've stored two blobs with SHA-1s A and B; you can then store a blob whose contents is literally just the string "A\nB" and hash that, getting SHA-1 C. When you pull C out again, its content is then just the pointers you need to the original two blobs via their SHA-1. The important property of SHA-1s is that they are (hopefully) unforgeable, which means that the SHA-1 C is all you need to uniquely identify the original blobs.

Git stores four sorts of blobs. The important ones are (glossing over some details here):
  1. Files, whose contents are stored directly (without the filename or other metadata).
  2. Trees, which represent directory structures. Imagine a text file that contains a sequence of lines "<path/for/file> <file SHA1>".
  3. Commits, which represent history. Imagine a text file that contains a commit message followed by the SHA-1s of a tree and one or more parent commits.

The important thing to note is that a single SHA-1 of a commit, because of the chained SHA-1 pointers, uniquely identifies not only the entire state of a tree of files, but also all previous states of the tree. One SHA-1 can identify an entire history. (But also note there's no provisions for tracking renames; they actually try to identify renames by looking for similar file content across different versions(!).)

Git's underlying store of these objects has some nice properties, like how files are only added and never modified. (This means it can make hardlinks between copies of local repositories without needing to implement copy on write.)


branches and tags
Here's where git starts to diverge from monotone. You have this pool of files and SHA-1s, and you need to know which SHA-1 to start at to do a checkout. Git uses text files, stored outside of the SHA-1 database in .git/refs/, that each contain an initial commit SHA-1. A new tree has one file: heads/master, aka the "master" branch. If you create a branch, all that does is make a new refs file. Commits "on" a branch add all the objects described in the previous section and then change the branch ref to point at the new commit, leaving the master ref alone.

If you've cloned from another source, its head is also represented with a branch at remotes/origin/master. This allows you to pull upstream (with "git fetch") and look at it before you attempt a merge. Git is the first DVCS I've seen that has a good story for looking at a remote person's work. You can pull their tree into your repository as a branch alongside your other branches, and can use the normal git tools for diffing and merging between branches. If you don't like their code, you can throw away the branch without leaving any effect on your code.

Tags are, to quote Linus, "100% the same thing" as branches: files that contain the SHA-1 of a particular commit. In practice they're handled differently by the git tools, like how "git branch" only displays branches and not tags, and
that commit update branch pointers but not tags, but the underlying representation and effects are identical. (There's a separate "tag object" concept used for GPG-signing golden releases, but I won't get into that.) You can even create your own tag-like files under the refs directory and git will pick them up and use them like tags.


offline processing
All of these structures involve forward pointers: branches point at commits point at trees point at files. When you delete a branch (suppose you decide to stop tracking some upstream source), how do you tell whether the data it pointed at is still useful? Git's solution is just to require a separate occasional garbage collection step that I imagine does the natural mark and sweep.

When I first learned of this it seemed a bit ridiculous, but on further reflection it's actually sorta sensible; there are other useful processes (such as "repacking", which restructures the database to make it more space-efficient as well as sync-efficient) that also take enough
time to run that you wouldn't want to do them "online" (in response to a user action). As I think Graydon observed, it's hard to beat git's sync speed when its "clone" operation literally involves shoveling compressed bytes directly off of disk over the socket. Contrast this
with, for example, SVN's FSFS backend, which writes each commit as deltas into separate files and makes checkout of even just the most recent version of a single file involve ferreting around in multiple files.


history rewriting
The other weird aspect of git that makes me think "content tracker" and not "version control" is that they're surprisingly cavalier about rewriting history. For example, consider the "git rebase" command. As I understand it, this command rolls your branch back to the point where it diverged, jumps forward to the head, then reapplies your branch's changes. (If that's not clear, there's a nice ASCII art diagram in the man page.) Instead of having history represent the flow of development (where there's a fork and then the forks meet again), it re-linearizes two branches into a serial path. This makes your history "cleaner" -- after all, if the rebase worked without problems, the changes were independent anyway -- but took me a while to grok because it seemed so strange to want to do this. (For example, it's not safe to do on any sort of shared repository.)

But again, consider the "Linus's global army" basis of the system: when you're sending code upstream, it's your responsibility to provide a patch series that is as clean as possible. The rebase man page linked to above also discusses how to take an existing series of commits and re-construct them in a new branch, to allow you to clean up each commit and remove unuseful experiments, so that the new branch can be what you submit.

(Personal aside: it doesn't seem that important to me to linearize history; in a sufficiently churning project you're going to have a branchy history and what you really need are tools to make that clearer. I always think of this picture from monotone-viz, showing a complicated project. On the other hand, this sort of behavior has prompted interesting-looking experiments, and in my mind that's always a good thing -- who knows what they'll discover.)


ecosystem and users
Git sorta gets a free pass in the n+1 space because it came with its killer app: Linux. Because of Linux, there is a surprising quantity of software built around git, such as repository browsers, GUIs, and importers from other systems. It also seems plausible to me (though I haven't thought it through) that because of its simplistic design it's easier to write an importer for git (or maybe it's just because there are more people contributing code) but the svn bridge for git is the best I've yet seen, supporting tracking an upstream svn repository and pushing commits out.


windows support
Git pretty much requires cygwin; the tools are written in a mixture of C, shell, and even Perl. There is a mingw port (which doesn't quite work for me, though I think my computer may be broken) as well as efforts to make a more "native" port, by rewriting the scripts in C -- even the MinGW port installer includes stuff like bash.

Git, being written by Linus, is likely biased in its performance characteristics towards Linux. It's noticeably slower on Windows, but mostly because it's so fast on Linux.


using git
The main thing people rave about with git is its speed, and I can see why. At work we deal with a few agonizingly slow version control systems and it has all the sorts of negative effects you'd expect: people adjust their workflow to avoid touching the VCS, which harms productivity as well as the processes we have for maintaining code quality; people don't test all changes against clean checkouts because they take too long. The difference between five seconds of latency and instantaneous can change your workflow entirely. Particularly git's branching is so lightweight it's painless to create and switch between branches and the normal workflow of doing anything is to start by creating a branch. If someone interrupts you with a quick fix, you can instantly flip your checkout to the state before the branch, apply the fix, and even rebase your branch on top of the fix so it's as if you inserted the fix into your history graph.


some negatives to be aware of
Their command "revert" actually creates an undo patch. The "revert" command seen in every other system is called "reset". (darcs still wins the prize for most gratuitously renamed commands.)

Git adds a middle layer between your code and committing it called "the index" or "staging area". I imagine this makes some aspect of the system easier, but can make status messages confusing. If you look at the "reset" docs you'll be confronted with this.

Git is low-level and tends to get pretty ugly when things go wrong. People are improving this rapidly, though (I guess 1.5 changed a bunch of the commands around) so I have hope this will change.


looking forward
The important things for me to realize about git were that (1) it'll never go away completely unless someone makes something significantly better for Linus, which is especially unlikely because git was made by him specifically for his workflow; (2) it's been adopted by other big projects like x.org and wine; and (3) there are a lot of people hacking in its space -- it moves quickly. Even if something like mercurial is "better" it's not significantly¹ so. To me it makes git sort of inevitable as the system of choice, despite its flaws and ugly corners.


1 Googlers who are familiar with the Yegge/Ruby debacle will recall Sanjay's comment regarding Ruby: "a language that is not significantly different than Python". Which from an immediate perspective seems almost inflammatory but taken honestly really has some truth: roughly the same constructs, performance characteristics, tools, etc.