10:13 am, 21 Oct 07
What you need to know about git
This is a braindump about git. I don't describe commands, like how to add and remove files, because they provide tutorials as well as translation guides for users of other tools. Instead, I'm trying to describe what distinguishes git from other VCSes in the same space (sort of a hobby of mine, see more posts) and what makes it interesting. This is the sort of thing I wish I could've read when I first started looking into this.
The place to start is at the philosophy, because that makes a lot of other decisions more clear. Imagine you're Linus. You're managing a huge code base with many contributors, but you don't really care much about version control as it currently exists -- you've been content in the past managing the whole thing by just patches and email. What do you need to get more work done with less pain?
With that in mind, git to me feels much more like a "content tracker" (their term) than a "version control system". It starts with a content-addressable file system as its primitive and then adds the minimal layer of glue on top of it to support some work flow, but above all the focus is speed and simplicity. For example, "git clone" is a shell script that calls curl and rsync, among others. For another example, git more or less doesn't handle renames in any sort of principled way, nor do they care much about merge algorithms -- I recall reading some thread where Linus argued that it's better to conflict than do something fancy because you want a human to look over it anyway in cases where the merge starts getting fancy. The whole thing still feels super weird to me because the similarity to monotone is obvious but the end result is really different.
object store
Let's start with the repository representation. (You can skip this section if you get monotone or mercurial, because as I understand it this is the bit of monotone that everyone borrowed.) Through underlying representations you don't need to care much about, imagine you can store and retrieve blobs of bytes keyed by their SHA-1. You can build pointers between blobs by storing collections of SHA-1s, which themselves get new SHA-1s representing the collection. Suppose you've stored two blobs with SHA-1s A and B; you can then store a blob whose contents is literally just the string "A\nB" and hash that, getting SHA-1 C. When you pull C out again, its content is then just the pointers you need to the original two blobs via their SHA-1. The important property of SHA-1s is that they are (hopefully) unforgeable, which means that the SHA-1 C is all you need to uniquely identify the original blobs.
Git stores four sorts of blobs. The important ones are (glossing over some details here):
The important thing to note is that a single SHA-1 of a commit, because of the chained SHA-1 pointers, uniquely identifies not only the entire state of a tree of files, but also all previous states of the tree. One SHA-1 can identify an entire history. (But also note there's no provisions for tracking renames; they actually try to identify renames by looking for similar file content across different versions(!).)
Git's underlying store of these objects has some nice properties, like how files are only added and never modified. (This means it can make hardlinks between copies of local repositories without needing to implement copy on write.)
branches and tags
Here's where git starts to diverge from monotone. You have this pool of files and SHA-1s, and you need to know which SHA-1 to start at to do a checkout. Git uses text files, stored outside of the SHA-1 database in .git/refs/, that each contain an initial commit SHA-1. A new tree has one file: heads/master, aka the "master" branch. If you create a branch, all that does is make a new refs file. Commits "on" a branch add all the objects described in the previous section and then change the branch ref to point at the new commit, leaving the master ref alone.
If you've cloned from another source, its head is also represented with a branch at remotes/origin/master. This allows you to pull upstream (with "git fetch") and look at it before you attempt a merge. Git is the first DVCS I've seen that has a good story for looking at a remote person's work. You can pull their tree into your repository as a branch alongside your other branches, and can use the normal git tools for diffing and merging between branches. If you don't like their code, you can throw away the branch without leaving any effect on your code.
Tags are, to quote Linus, "100% the same thing" as branches: files that contain the SHA-1 of a particular commit. In practice they're handled differently by the git tools, like how "git branch" only displays branches and not tags, and
that commit update branch pointers but not tags, but the underlying representation and effects are identical. (There's a separate "tag object" concept used for GPG-signing golden releases, but I won't get into that.) You can even create your own tag-like files under the refs directory and git will pick them up and use them like tags.
offline processing
All of these structures involve forward pointers: branches point at commits point at trees point at files. When you delete a branch (suppose you decide to stop tracking some upstream source), how do you tell whether the data it pointed at is still useful? Git's solution is just to require a separate occasional garbage collection step that I imagine does the natural mark and sweep.
When I first learned of this it seemed a bit ridiculous, but on further reflection it's actually sorta sensible; there are other useful processes (such as "repacking", which restructures the database to make it more space-efficient as well as sync-efficient) that also take enough
time to run that you wouldn't want to do them "online" (in response to a user action). As I think Graydon observed, it's hard to beat git's sync speed when its "clone" operation literally involves shoveling compressed bytes directly off of disk over the socket. Contrast this
with, for example, SVN's FSFS backend, which writes each commit as deltas into separate files and makes checkout of even just the most recent version of a single file involve ferreting around in multiple files.
history rewriting
The other weird aspect of git that makes me think "content tracker" and not "version control" is that they're surprisingly cavalier about rewriting history. For example, consider the "git rebase" command. As I understand it, this command rolls your branch back to the point where it diverged, jumps forward to the head, then reapplies your branch's changes. (If that's not clear, there's a nice ASCII art diagram in the man page.) Instead of having history represent the flow of development (where there's a fork and then the forks meet again), it re-linearizes two branches into a serial path. This makes your history "cleaner" -- after all, if the rebase worked without problems, the changes were independent anyway -- but took me a while to grok because it seemed so strange to want to do this. (For example, it's not safe to do on any sort of shared repository.)
But again, consider the "Linus's global army" basis of the system: when you're sending code upstream, it's your responsibility to provide a patch series that is as clean as possible. The rebase man page linked to above also discusses how to take an existing series of commits and re-construct them in a new branch, to allow you to clean up each commit and remove unuseful experiments, so that the new branch can be what you submit.
(Personal aside: it doesn't seem that important to me to linearize history; in a sufficiently churning project you're going to have a branchy history and what you really need are tools to make that clearer. I always think of this picture from monotone-viz, showing a complicated project. On the other hand, this sort of behavior has prompted interesting-looking experiments, and in my mind that's always a good thing -- who knows what they'll discover.)
ecosystem and users
Git sorta gets a free pass in the n+1 space because it came with its killer app: Linux. Because of Linux, there is a surprising quantity of software built around git, such as repository browsers, GUIs, and importers from other systems. It also seems plausible to me (though I haven't thought it through) that because of its simplistic design it's easier to write an importer for git (or maybe it's just because there are more people contributing code) but the svn bridge for git is the best I've yet seen, supporting tracking an upstream svn repository and pushing commits out.
windows support
Git pretty much requires cygwin; the tools are written in a mixture of C, shell, and even Perl. There is a mingw port (which doesn't quite work for me, though I think my computer may be broken) as well as efforts to make a more "native" port, by rewriting the scripts in C -- even the MinGW port installer includes stuff like bash.
Git, being written by Linus, is likely biased in its performance characteristics towards Linux. It's noticeably slower on Windows, but mostly because it's so fast on Linux.
using git
The main thing people rave about with git is its speed, and I can see why. At work we deal with a few agonizingly slow version control systems and it has all the sorts of negative effects you'd expect: people adjust their workflow to avoid touching the VCS, which harms productivity as well as the processes we have for maintaining code quality; people don't test all changes against clean checkouts because they take too long. The difference between five seconds of latency and instantaneous can change your workflow entirely. Particularly git's branching is so lightweight it's painless to create and switch between branches and the normal workflow of doing anything is to start by creating a branch. If someone interrupts you with a quick fix, you can instantly flip your checkout to the state before the branch, apply the fix, and even rebase your branch on top of the fix so it's as if you inserted the fix into your history graph.
some negatives to be aware of
Their command "revert" actually creates an undo patch. The "revert" command seen in every other system is called "reset". (darcs still wins the prize for most gratuitously renamed commands.)
Git adds a middle layer between your code and committing it called "the index" or "staging area". I imagine this makes some aspect of the system easier, but can make status messages confusing. If you look at the "reset" docs you'll be confronted with this.
Git is low-level and tends to get pretty ugly when things go wrong. People are improving this rapidly, though (I guess 1.5 changed a bunch of the commands around) so I have hope this will change.
looking forward
The important things for me to realize about git were that (1) it'll never go away completely unless someone makes something significantly better for Linus, which is especially unlikely because git was made by him specifically for his workflow; (2) it's been adopted by other big projects like x.org and wine; and (3) there are a lot of people hacking in its space -- it moves quickly. Even if something like mercurial is "better" it's not significantly¹ so. To me it makes git sort of inevitable as the system of choice, despite its flaws and ugly corners.
1 Googlers who are familiar with the Yegge/Ruby debacle will recall Sanjay's comment regarding Ruby: "a language that is not significantly different than Python". Which from an immediate perspective seems almost inflammatory but taken honestly really has some truth: roughly the same constructs, performance characteristics, tools, etc.
The place to start is at the philosophy, because that makes a lot of other decisions more clear. Imagine you're Linus. You're managing a huge code base with many contributors, but you don't really care much about version control as it currently exists -- you've been content in the past managing the whole thing by just patches and email. What do you need to get more work done with less pain?
With that in mind, git to me feels much more like a "content tracker" (their term) than a "version control system". It starts with a content-addressable file system as its primitive and then adds the minimal layer of glue on top of it to support some work flow, but above all the focus is speed and simplicity. For example, "git clone" is a shell script that calls curl and rsync, among others. For another example, git more or less doesn't handle renames in any sort of principled way, nor do they care much about merge algorithms -- I recall reading some thread where Linus argued that it's better to conflict than do something fancy because you want a human to look over it anyway in cases where the merge starts getting fancy. The whole thing still feels super weird to me because the similarity to monotone is obvious but the end result is really different.
object store
Let's start with the repository representation. (You can skip this section if you get monotone or mercurial, because as I understand it this is the bit of monotone that everyone borrowed.) Through underlying representations you don't need to care much about, imagine you can store and retrieve blobs of bytes keyed by their SHA-1. You can build pointers between blobs by storing collections of SHA-1s, which themselves get new SHA-1s representing the collection. Suppose you've stored two blobs with SHA-1s A and B; you can then store a blob whose contents is literally just the string "A\nB" and hash that, getting SHA-1 C. When you pull C out again, its content is then just the pointers you need to the original two blobs via their SHA-1. The important property of SHA-1s is that they are (hopefully) unforgeable, which means that the SHA-1 C is all you need to uniquely identify the original blobs.
Git stores four sorts of blobs. The important ones are (glossing over some details here):
- Files, whose contents are stored directly (without the filename or other metadata).
- Trees, which represent directory structures. Imagine a text file that contains a sequence of lines "<path/for/file> <file SHA1>".
- Commits, which represent history. Imagine a text file that contains a commit message followed by the SHA-1s of a tree and one or more parent commits.
The important thing to note is that a single SHA-1 of a commit, because of the chained SHA-1 pointers, uniquely identifies not only the entire state of a tree of files, but also all previous states of the tree. One SHA-1 can identify an entire history. (But also note there's no provisions for tracking renames; they actually try to identify renames by looking for similar file content across different versions(!).)
Git's underlying store of these objects has some nice properties, like how files are only added and never modified. (This means it can make hardlinks between copies of local repositories without needing to implement copy on write.)
branches and tags
Here's where git starts to diverge from monotone. You have this pool of files and SHA-1s, and you need to know which SHA-1 to start at to do a checkout. Git uses text files, stored outside of the SHA-1 database in .git/refs/, that each contain an initial commit SHA-1. A new tree has one file: heads/master, aka the "master" branch. If you create a branch, all that does is make a new refs file. Commits "on" a branch add all the objects described in the previous section and then change the branch ref to point at the new commit, leaving the master ref alone.
If you've cloned from another source, its head is also represented with a branch at remotes/origin/master. This allows you to pull upstream (with "git fetch") and look at it before you attempt a merge. Git is the first DVCS I've seen that has a good story for looking at a remote person's work. You can pull their tree into your repository as a branch alongside your other branches, and can use the normal git tools for diffing and merging between branches. If you don't like their code, you can throw away the branch without leaving any effect on your code.
Tags are, to quote Linus, "100% the same thing" as branches: files that contain the SHA-1 of a particular commit. In practice they're handled differently by the git tools, like how "git branch" only displays branches and not tags, and
that commit update branch pointers but not tags, but the underlying representation and effects are identical. (There's a separate "tag object" concept used for GPG-signing golden releases, but I won't get into that.) You can even create your own tag-like files under the refs directory and git will pick them up and use them like tags.
offline processing
All of these structures involve forward pointers: branches point at commits point at trees point at files. When you delete a branch (suppose you decide to stop tracking some upstream source), how do you tell whether the data it pointed at is still useful? Git's solution is just to require a separate occasional garbage collection step that I imagine does the natural mark and sweep.
When I first learned of this it seemed a bit ridiculous, but on further reflection it's actually sorta sensible; there are other useful processes (such as "repacking", which restructures the database to make it more space-efficient as well as sync-efficient) that also take enough
time to run that you wouldn't want to do them "online" (in response to a user action). As I think Graydon observed, it's hard to beat git's sync speed when its "clone" operation literally involves shoveling compressed bytes directly off of disk over the socket. Contrast this
with, for example, SVN's FSFS backend, which writes each commit as deltas into separate files and makes checkout of even just the most recent version of a single file involve ferreting around in multiple files.
history rewriting
The other weird aspect of git that makes me think "content tracker" and not "version control" is that they're surprisingly cavalier about rewriting history. For example, consider the "git rebase" command. As I understand it, this command rolls your branch back to the point where it diverged, jumps forward to the head, then reapplies your branch's changes. (If that's not clear, there's a nice ASCII art diagram in the man page.) Instead of having history represent the flow of development (where there's a fork and then the forks meet again), it re-linearizes two branches into a serial path. This makes your history "cleaner" -- after all, if the rebase worked without problems, the changes were independent anyway -- but took me a while to grok because it seemed so strange to want to do this. (For example, it's not safe to do on any sort of shared repository.)
But again, consider the "Linus's global army" basis of the system: when you're sending code upstream, it's your responsibility to provide a patch series that is as clean as possible. The rebase man page linked to above also discusses how to take an existing series of commits and re-construct them in a new branch, to allow you to clean up each commit and remove unuseful experiments, so that the new branch can be what you submit.
(Personal aside: it doesn't seem that important to me to linearize history; in a sufficiently churning project you're going to have a branchy history and what you really need are tools to make that clearer. I always think of this picture from monotone-viz, showing a complicated project. On the other hand, this sort of behavior has prompted interesting-looking experiments, and in my mind that's always a good thing -- who knows what they'll discover.)
ecosystem and users
Git sorta gets a free pass in the n+1 space because it came with its killer app: Linux. Because of Linux, there is a surprising quantity of software built around git, such as repository browsers, GUIs, and importers from other systems. It also seems plausible to me (though I haven't thought it through) that because of its simplistic design it's easier to write an importer for git (or maybe it's just because there are more people contributing code) but the svn bridge for git is the best I've yet seen, supporting tracking an upstream svn repository and pushing commits out.
windows support
Git pretty much requires cygwin; the tools are written in a mixture of C, shell, and even Perl. There is a mingw port (which doesn't quite work for me, though I think my computer may be broken) as well as efforts to make a more "native" port, by rewriting the scripts in C -- even the MinGW port installer includes stuff like bash.
Git, being written by Linus, is likely biased in its performance characteristics towards Linux. It's noticeably slower on Windows, but mostly because it's so fast on Linux.
using git
The main thing people rave about with git is its speed, and I can see why. At work we deal with a few agonizingly slow version control systems and it has all the sorts of negative effects you'd expect: people adjust their workflow to avoid touching the VCS, which harms productivity as well as the processes we have for maintaining code quality; people don't test all changes against clean checkouts because they take too long. The difference between five seconds of latency and instantaneous can change your workflow entirely. Particularly git's branching is so lightweight it's painless to create and switch between branches and the normal workflow of doing anything is to start by creating a branch. If someone interrupts you with a quick fix, you can instantly flip your checkout to the state before the branch, apply the fix, and even rebase your branch on top of the fix so it's as if you inserted the fix into your history graph.
some negatives to be aware of
Their command "revert" actually creates an undo patch. The "revert" command seen in every other system is called "reset". (darcs still wins the prize for most gratuitously renamed commands.)
Git adds a middle layer between your code and committing it called "the index" or "staging area". I imagine this makes some aspect of the system easier, but can make status messages confusing. If you look at the "reset" docs you'll be confronted with this.
Git is low-level and tends to get pretty ugly when things go wrong. People are improving this rapidly, though (I guess 1.5 changed a bunch of the commands around) so I have hope this will change.
looking forward
The important things for me to realize about git were that (1) it'll never go away completely unless someone makes something significantly better for Linus, which is especially unlikely because git was made by him specifically for his workflow; (2) it's been adopted by other big projects like x.org and wine; and (3) there are a lot of people hacking in its space -- it moves quickly. Even if something like mercurial is "better" it's not significantly¹ so. To me it makes git sort of inevitable as the system of choice, despite its flaws and ugly corners.
1 Googlers who are familiar with the Yegge/Ruby debacle will recall Sanjay's comment regarding Ruby: "a language that is not significantly different than Python". Which from an immediate perspective seems almost inflammatory but taken honestly really has some truth: roughly the same constructs, performance characteristics, tools, etc.
Well both python and ruby share one great advantage over perl - they are a lot more human readable ;-)
People continually claim that this is no issue at all.
I enjoy this attitude while I watch how ruby and python grows whereas perl shrinks as far as new users go. Remember, people grow older, fresh blood needs newcomers ;-)
We purchased a set of school books from a visiting European student. One of the pictures featured in a few of them looks strikingly like your "buddy icon" or whatever it's called.
</OT>
darcs commands
Evan,Thanks for a nice overview.
Darcs now has some aliases and stubs for commands. For example, move is an alias for mv, rm is a stub that tells you to just delete the file, and commit explains the difference between record and push/pull. Any other suggestions for aliases and stubs?
Re: darcs commands
It always seemed it's be pretty easy to me to make commands unified: just do whatever CVS does. I like darcs -- I even used it for a lot of projects -- but giving commands names like "whatsnew" feels almost irresponsible given that I have to deal with a bunch of similar-but-different systems already.Re: darcs commands
I can understand the annoyance that this provokes, but maybe we should give David Roundy the benefit of the doubt. For example, CVS commit conflates the notions of saving your work locally and remotely. In that light, it may be wise to scrupulously avoid using the word "commit" and go with something different, like "record" because the two are really not the same thing. Not to say that it was the right choice, just that it may have been more well-intentioned than it was whimsical :-)Likewise, "whatsnew" doesn't directly correspond to anything in CVS
* cvs diff is the same as darcs diff
* cvs update -n does more or less the same thing as whatsnew -s
* but nothing really shows local changes in the darcs-specific manner
So given that darcs commands occasionally 'split up' certain CVS commands and amalgamate parts of other ones, renaming them probably seemed like the responsible thing to do at the time. On the other hand, he could also have gone the Mercurial route wrt "commit", and just accept that the same commands over different systems have subtly different meanings... which is sort of inevitable anyway. I guess accepting that kind of semantic trade-off in return for added learnability and comparability makes sense too. Anyway don't get me wrong, I'm not trying to drag you into angels-dancing-on-pinheads debate.
(*) That said "log" sounds like it would be a nice alias of "changes"
Re: darcs commands
I agree, both in that I should give him the benefit of the doubt and in that despite the semantics being subtly different it's implied by the fact you're running "foo diff" instead of "bar diff".I suppose I could make the same argument in git's defense -- that "git reset" is different than "revert" in other VCSes (which it is; it has a bunch of weird options for different kinds of reverts) -- but in practice when I'm looking for revert-like behavior the first place I look is the "revert" man page. Which is why darcs made me grumpy in the first place.
(I feel like when I learned darcs that "diff" wasn't available, or behaved differently (?) because I seem to recall recall getting frustrated whenever I typed the wrong command. But maybe I just learned it wrong.)
- When you switch between branches there's not a simple way to save your current un-checked-in work.
- When you switch between branches you have to rebuild a bunch of objects.
Because of this, I think the right way to do parallel development is still to use multiple directories. ISTR you can make them share some of the backing repo with a "references" section, but I don't think that means they share branches, which is what I'd really want.
git checkout branch1
git add -u # add everything modified to the index
git checkout master
git status # shows the same modifications as before
git stash apply
oh, and
How and why do you use git-rebase? I still don't really get it.Re: oh, and
Ah, I see. So it's not especially the rebasing as such as it is the continual merging from upstream. (You could imagine, instead of running "rebase" each time, running "merge" each time and then only rebasing when it's time to commit. Not saying that'd be easier than what you're doing, but that's roughly how other DVCSes do it when contributors are mailing in patches.)Re: oh, and
I also rebase often to clean up the history of development so that changes are correct and grouped properly. A made-up example: I rename function 'foo' to 'bar' and commit it with a comment that says, "renamed foo to bar". Then I move on to the next feature or fix, and commit that, and move on to the next. Halfway through that, I find that I missed an instance of 'foo'! I commit my work in progress (or use git-stash), fix the missing 'foo' and commit that, then use 'git-rebase --interactive' to merge the all the 'foo' fixes together into one clean commit. If I didn't use git-stash, then I'll use 'git-commit --amend' (another form of rebasing) when I finally finish the feature that was in progress.When my patches are pushed for review, all the pieces are correct and tell a coherent story. Sometimes I use git-rebase --interactive just to make adjacent temporally-separated changes which affect the same bits, so that changes are in context.
Like you said, it's a lie, but the lie is much easier for others to understand than the truth.