evan_tech

Previous Entry Share Next Entry
I've been playing with monotone recently, bugging graydon for tech support (thanks!), and at one point in my struggle to understand he observed that it felt "obvious" to him. It really struck me, because I can remember struggling (well, a bit) to understand CVS and then struggling (more here) to understand arch but in retrospect it all seems pretty obvious too. I think it's the sort of thing where I'm just not used to thinking about these sorts of problems, or of what behavior I'd like, because I'm accustomed to my current workflow and haven't critically examined it. So here I'll write down how I think about these things and maybe it'll make understanding easier for you.

With CVS (with branches aside, because people don't generally use them) there is one repository and you think of it as the repository: your checkout is a snapshot of what's going on at the source, and all work is done relative to that. Diffs, updates, conflicts, and everything are easy to understand in terms of your divergence from the "real" work, and the purpose of the repository is so you can share your changes with others.

Distributed systems diverge from that in a few interesting ways, beyond just the "make forking easy" aspect that they're normally described with. Arch's whole deal is that branches, independent of permission from the original source, are really painless to create, modify, and merge back in. So now the mental model is multiple separate repositories that can all push and pull changes from each other. (If you can really understand that, you've understood almost all there is to arch aside from its gratuitously weird naming conventions.)

What free branches mean in practice, though, is that commits are no longer about completed features. When I was committing to LiveJournal's CVS, for example, I wouldn't commit a change that added a new link to some page until I also had ready the resulting page, along with that page's implementation, schema changes, whatever. Until I've completed all of that work, the repository is completely unused. By contrast, with arch, a new feature would be a new branch, and each separate working piece of the feature is committed to the branch. This difference means that your repository is actually used for tracking changes and revisions of files, instead of completed features. The arch tutorial really emphasized this, even encouraging you to follow his workflow of logging each change in a file that becomes the commit log.

(Right here it's arguable that svn's lightweight branches accomplish the same goal, but I don't think I agree. Here's a previous post that links to a more thorough discussion of it, but one practical way to think about it is again in terms of LiveJournal's development. There have been plenty of competent(?) people who've done good work for LJ that aren't necessarily the sort to ask for or be granted permission to commit: people like timwi, mart, or nikolasco. As I understand it, svk uses the arch model.)



What Graydon had argued before is that the implementation determines the workflow, and while this is true for CVS or arch, they're pretty simple: there's a natural mapping of repositories to directories holding your work. (One of arch's design points is that you can get all your code out using standard unix utilities like "tar" and "patch".) monotone's model is actually pretty crazy in comparison to all this: through the magic of SHA-1, every version of every file can be uniquely identified, and a given version of a branch is identified by the SHA-1 of a collection of file ids. Concepts like branches or even filenames (I'm unsure, here -- still new at this) are just metadata on a repository.

This means that, for example, among other things it's perfectly legal to commit a conflicting change to a branch. If you and I check out version A of a file, then you commit version B and I commit version C, we've simply created two heads. Merging these into a single head can be a secondary step, and this is arguably a good thing: if a version control system really is for tracking versions of files and I feel like I'm done with my version C, I want to be able to save C somewhere before I worry about fixing the conflicts. So now the mental model really is that the system just tracks the versions of files I'm working on in this huge pool and helps me merge, track ancestry, and exchange them with others.

I can also sync your repository, disconnect from the network, commit to my local copy, and later sync back all of my changes. This feature alone has pretty much converted me, 'cause I do a lot of my programming on my laptop without network availability.

monotone's got a lot of other crazy things going on. An efficient netsync protocol is often near the top of the feature list, and there's some PKI stuff that I don't get yet -- I can pull your commits to my code into my repository but I think they don't count unless I accept your keys(?). (And if you wanna know true fear, look at the screenshot of the ancestry graph on this page.)

monotone definitely warrants further investigation. If you do play with it, you can pull my crossword code from neugierig.org, branch name org.neugierig.crossword.