evan_tech

Previous Entry Share Next Entry
11:04 am, 9 Mar 08

the git index

Here's an explanation of the git "index" or "cache" (two names for the same thing), since this had confused me in the past. I understood it technically, but I couldn't see a reason for it. I'm still not saying it's better/worse than anything else, but I do think it's interesting to think about.

When doing a commit, a version control system typically:
  1. Examines which files have changed in your working copy.
  2. Constructs data structures representing how these changes would be stored in your repo.
  3. Flushes those data structures to the repo.
The git "index" or "cache" is just the on-disk state of step #2, which is a separate step in git. Most (all?) other systems make 2 and 3 a single step. git add only does step 2; git commit only does step 3; git commit -a does steps 2 and 3 as one.

Contrast these two concepts. #1: Most (all?) version control systems let you mark files "for add", which sets a flag somewhere in the repo's internals that the file should be included in the next commit. The for-add state is written to disk somewhere, so you can run other commands that examine and change that state later. #2: Systems like darcs let you "cherry-pick" changes when committing: it walks through the diff of your working tree and the repo and asks you, for each change, whether you want it to be part of this commit. This set of cherry-picked changes for the commit is only kept in memory for the duration of the interactive prompting by darcs record.

Git went to the other extreme: all state* that becomes part of a commit is first put into the index, and then the commit command just flushes that state without even examining your working copy. It's an interesting design decision, because it means that the darcs-style cherry-picking of changes can be added as an afterthought feature (which they have done with git add -i, though it's way clunkier than darcs).

Merging is also a bit interesting. Software that does three-way merges often (always?) also has the #2-style on-line behavior, where during the merge command you must resolve each set of conflicts. Git, again, keeps this state in the index. When a merge conflicts, it dumps you back to the shell immediately, but the index stores all three (in the case of a three-way merge) versions of the file. git mergetool then lets you run your merge tool "offline" on a given conflicting file by reading the three versions back out of the index. You can't commit the index (this is the "C" state in subversion) until you've resolved the conflicts and updated the index reflecting that.



* Except for the commit message, I think?