The Risks of Distributed Version Control
It’s funny how times change. When we started writing Subversion five years ago, CVS was the big evil beast that we were aiming to “subvert”. These days, while Subversion still has a long way to go in performance and features, it has reached critical mass in the open source world. The users@ list has thousands of subscribers and has become self-supporting. Every major free operating system ships with a relatively recent version of subversion. There are several books in the bookstore about Subversion. Major projects like KDE, Apache, and GCC have switched to it, along with dozens of others. When you run across an open source project using Subversion these days, it’s no longer a novelty. It’s become the default safe choice for most new projects.
And now, lo and behold, a whole new generation of version control systems has appeared on the horizon: arch, codeville, monotone, bazaar-ng, svk, git, mercurial. These are the new kids in town — the Distributed Version Control systems — and they aim to unseat the establishment. Yes, you heard right: Subversion is now The Man. I wonder when that happened? 🙂
What makes this new generation of systems fundamentally different is that they take the idea of “disconnected operations” to the extreme. Every user has an entire copy of the repository — 100% of a project’s history — stored on the local computer. Each person is effectively an island onto themselves. Users connect their private repostories together in any way they wish and trade changes like baseball cards; the system automatically tracks which changes you have and which ones you don’t.
There’s something fresh and self-empowering about this model, because it’s a superset of CVS and Subversion’s traditional single-repository model. An open source project can decide that exactly one repository is the Master, and expect all participants to push and pull changes from that master repository as needed. Of course, a project can also organize itself into more interesting shapes: a tree-like hierarchy of repositories, a ring of repositories, or even just a randomly connected graph. It’s tremendously flexible.
Proponents of these systems tend to be a bit fanatical about their “superiority” over today’s centralized systems. Over and over, I hear testimonials like this:
“It’s great! If I want to implement a new feature, I don’t need to have commit access at all. I have my own private copy of the repository, so I can write the whole thing by myself. I commit changes to my private repository as often as I want until it’s all finished. Then I can present the whole thing to everyone else.”
This user is describing a great convenience, but I view it in a slightly darker light. Notice what this user is now able to do: he wants to to crawl off into a cave, work for weeks on a complex feature by himself, then present it as a polished result to the main codebase. And this is exactly the sort of behavior that I think is bad for open source communities. Open source communities need to work together. They need to agree on common goals, discuss designs, and constantly review each other’s work.
In the subversion community, we call the behavior above “dropping a bomb”. It’s considered anti-social and anti-cooperative. Usually the new feature is so big and complex, it’s nearly impossible to review. If it’s hard to review, then it’s hard to accept into the main codebase, hard to maintain code quality, and hard for anyone but the original author to maintain the feature. When this happens, we typically scold the person(s) for not working out in the open.
Good Behavior, on the other hand, involves coming to the community with a design proposal. After some discussion, we ask the developer(s) to either (1) submit a series of patches as work progresses, or (2) give him (or them) a private branch to work on. They needn’t have commit-access to the core code — a branch is all that’s needed. That way the larger community can review the smaller commits as they come in, discuss, give feedback, and keep the developers in the loop. The main goal here is never to be surprised by some huge code change. It keeps the community focused on common goals and aware of each other’s progress.
So while most people say: “isn’t it great that I can fork the whole project without anyone knowing! ” My reaction is, “yikes, why aren’t you working with everyone else? why aren’t you asking for commit access?” This is a problem that is solved socially: projects should actively encourage side-work of small teams and grant access to private branches early and often.
I probably sound like Mr. Anti-Distributed-Version-Control, but I’m really not. It’s definitely cool, and definitely convenient. I just think it’s something that needs to be used very carefully, because the very conveniences it provides also promote fragmentary social behaviors that aren’t healthy for open source communities.
For more on this subject, see this essay by Greg Hudson — it was the writing which originally had my head nodding on this topic. Also relevant is Karl Fogel’s excellent new book, Producing Open Source Software. It’s all about managing and promoting healthy open source developer communities.
Blush. When I implemented the preprocessor macro support in GDB, that was definitely “dropping a bomb”, in all the ways you cite: I didn’t tell anyone I was working on it, I didn’t solicit any input on the design, and I just posted a big honkin’ patch and said, “There — I did it.”
I was really afraid when I started that if I did it in a more open way, people would expand the problem and declare limited solutions inadequate. I was right, in that people practically strung me up after I posted the patch (I now see that they aren’t the only group who would have reacted that way) because I hadn’t used libcpp (GCC’s new preprocessor, packaged as a library) to do my expansion — I’d written my own quick and dirty (and incorrect, as I’d known it would be) expander. I pointed out that I’d isolated the expander behind a very simple, small interface to make it easy to replace, but that didn’t really calm people down much.
I did the work by myself because I felt like the time required to argue and persuade people would have pushed me over the limit of how much initiative and energy and time I actually had. The whole thing was written in a few weekends and part of a spring break. (And it is not sloppy; the expander is hairy, but carefully commented.) And libcpp, at the time, wasn’t actually its own library; it was a bunch of source files that could theoretically be pulled out to be stand-alone, and there was an interface there, but they shared GCC’s build machinery. So I would have had to also undertake that project as well. I was afraid the unfun stuff would soak up my gumption, and preprocessor macro expansion would remain undone for another ten years. After all, it was a huge gaping whole in GDB’s C support, and had been there since the beginning, and nobody had ever really tackled it.
So, I dunno. I’m a libertarian with a fascist heart.
I agree with pretty much everything you said, except that I’d stress that it isn’t just Open Source projects that need this kind of communication, it’s all projects. It’s just as bad to crawl off into a hole and produce a 20K line patch bomb in a proprietary project as it is in Open Source.
Distributed version control tool doesn’t force you to hide you work before merge happens. You can publish your tree (and most of open source developers is doing that!). The nice thing about distributed is that the tools allows you to work disconnected, commit things into your own tree, track your own work. When you want to merge things you will merge whole history which is very nice.
“isn’t it great that I can fork the whole project without anyone knowing! †Yeah, you can do that even with subversion, does that make subversion bad?
This blog entry should be about _developers_ who think that way, not their tools.
I have reservations about reservations about decentralized versioning. (By the way, Ben, thanks for the book link from your blog!)
While it’s easy for us to count the number of times someone shows up with an unexpected nuclear power plant, it’s much harder to count the times someone *didn’t* start working on some experimental feature because they didn’t have a convenient way to work silently in parallel with the mainline project. I don’t think it’s true that features always benefit from early design input, any more than features always benefit from a sheltered and private upbringing. The truth is much more complex: at different stages and with different people, different kinds of community engagement (or lack of it) are called for.
Also, I think Greg Hudson’s objections to BitKeeper may be different from Ben’s. Greg’s objecting to the Single Integrator aspect of Linux kernel development, not to the Skunkworks Power Plant aspect of decentralized version control systems. And actually, he’s not really objecting to anything, he’s just saying that Linux’s development model is highly unusual and not appropriate for most projects.
Regarding doing stuff in private:
Each developer knows her own gumption, knows what it takes to keep herself going. She has to take that into account when deciding how to implement something, the same way even the most altruistic politician has to keep electability in mind. Her own gumption is a technical consideration, because if she doesn’t keep coding, she can’t implement anything (“if you’re not in power, you can’t do good”).
One could argue that she should instead learn not to have your gumption sucked by long and fractious design discussions… But honestly, whose gumption isn’t sucked by long and fractious design discussions? In real life, we only subject ourselves to those *after* we’ve satisfied ourselves that the feature is worthwhile and feasible. Depending on the feature and how it fits in, that sometimes means staying in the skunkworks for a long time.
There’s also other people’s gumption to consider: sometimes you want to show up with a nuclear power plant just so that people who might otherwise have objected by default will instead only object where they feel something very important needs correcting (in other words, doing a lot of work in advance can be a way of raising the quality of the criticism the work faces later).
When working in private is a problem, it is a social problem, as Ben says. Therefore, I think it should have social solutions, not tool-enforced solutions. I’d rather have a tool that lets me do what I judge I need to do, and have humans who help me hone that judgement so that I use the tool well.
In other words, +1 on distributed version control. Not that I’ve ever used it, of course. 🙂
Replying to arekm:
It’s not about what tools force people to do or not do. It’s about the behaviors that the tools naturally encourage: they make certain things easy and other things harder.
In a distributed system, every user starts with a fully legitimate fork of the project. The ‘official’ tree is purely a matter of social convention. And by default, all the work you to do is hidden: it’s published to your private repository. You have to work a little harder to push your work to the ‘official’ tree or get others to pull from your own.
In a centralized system, there’s only one repository. Forking is technically hard to do; it’s often hard to get a copy of all of the project’s history. By default, all the work you do is public — it’s published to the one shared repository. It’s awkward to work privately, because it means building up a huge patch without any saving any checkpoints.
Do people using centralized systems sometimes develop code bombs in private anyway? Do they sometimes fork? Sure, but it’s rare. Do people using distributed systems work in unified, cohesive teams? Hard to tell, since there aren’t many such systems in widespread use. I predict it will be harder to keep a project socially centralized when the tool makes it easier to be fragmented. But the core MySQL developers (whom I’ve met) use bitkeeper and are incredibly organized as a team. My instinct, though, is that they’re a rare exception. I guess time will tell.
This article does have a point; but I feel like it badly misrepresents what DVCSes are trying to do. Nobody I know who works on DVCS is doing it because they want to make it easier for people to drop bombs; on the contrary, we worry about these things too. But we don’t let it consume us — there are, after all, other ways to respond, like simply saying “no, this sucks” and not accepting the patch until they fix it. (You know, like we do when people do the same thing now.) Yes, DVCSes change the social landscape there, they make certain things easier, etc., but there are lots of other mechanisms that can be used to make collaboration run smoothly. Even if the practices we happen to be used to now _don’t_ carry over perfectly, well… maybe there are other ones that can pick up the slack. We should be talking about both sides of this question, not just the first part.
So, yeah, still a lot of work, why bother… well, because DVCSes are bringing a lot to the party besides ease of bombing. Like I said, ease of bombing is completely not the point. The DVCS community is incredibly lively these days, and making huge strides in the state of the art in all kinds of basic ways. User-friendly branching models. Real merge tracking. Sophisticated new merge algorithms. End-to-end data integrity checking and digital signatures. Simpler designs that do more — the “monotone style” design, later picked up by git and mercurial, and arguably bzr-ng, is simple, elegant, powerful, and hey, it turns out you’d have to work hard to make it _not_ distributed. A nice thing about distributedness is knowing that you have no single point of failure. Monotone, the system I have the most familiarity with, puts a huge priority on robustness — a VCS has higher reliability requirements than any other software I can think of outside of like, aerospace or medical contexts. (And they need VCS too!) So the fact that when used in the natural way, the tool constantly produces (trustable!) backups of everything related to a project, scattered on every developer’s hard drive, and failing over to one of these backups is, in practice, just as natural as anything else using the tool… we think that more than pays for any pain that might come from distributedness, all by itself.
And, hey, being able to commit on airplanes is sometimes handy too.
Finally, it needs to be said… this whole debate is somewhat irrelevant. The cat is out of the bag, and thanks to, well, “subversive” technologies like svk (or tailor, or several similar tools I know of in the pipeline), a project’s core team simply no longer has the _ability_ to prevent their users from using powerful tools in the privacy of their own home.
And, really, maybe that’s a good thing. I have every right to maintain a fork if that’s how I want to spend my time; the right to fork is, like, freedom number 0. And if I do choose to maintain a fork for a while, I have every right to use good tools while doing so.
I don’t mean to trivialize your worries; I hope I haven’t come off as doing so. Like I said at the beginning, how DVCS will affect the whole lifecycle of development is something that worries me too. But we need to look at the whole picture; I’m optimistic.
Right, back to debugging this merge algorithm…
I think you have the matter backwards. People do not develop “bombs” in isolation for fun. Bombs are a symptom of social and technical breakdown.
The breakdown is caused by the requirement that everyone agree on “the” set of legitimate branches. To achieve this requires quite a bit of technical and occasionally political argument. The mailing list is plagued by arguments about the official direction and the set of new work. Often very tedious and futile arguments, concerned with strawmen and vapourware, dominated by back-seat programmers who won’t be doing the work anyways. Not always productive.
The gumption-sucking nature of such argument is, as Karl suggests, the reason people develop “bombs” in private. They can’t be bothered to go through the work it would take to convince people to ignore them while they do it in public.
Branching in distributed systems is zero cost — both technically and politically — so it fixes this breakdown. Think of it like an environment in which starting a business is very cheap and easy, and morally encouraged. What’s the result? Lots of competing, public branches. Not hidden skunkworks projects. Ongoing, healthy, actively-maintained ecological diversity, with very cheap operations for stealling one another’s ideas and developers. A marketplace of branches, rather than a planned set. The result is greatly increased development and greatly reduced argument. Less talk, more code.
Monotone for example has some 40 branches. This is a relatively small project: 40kloc, a dozen active developers. Many of the branches are one-man affairs, doing things we didn’t all initially think possible, or think to be good ideas, or want to commit fully to. Several are still unclear. Others turned out to be good ideas.
But while they are in development, they are all public. No argument has to happen: the developers start working, and other developers (and users) start observing and playing with their work. If it’s a bad idea, it’ll die on its own. We can all see the branches maturing, even those which looked to be a lot of work. We can comment on them, commit fixes, check them out and see how the work is coming along, see how well the new feature works. We can collaborate on making the new features robust.
If they turn out to be good ideas, it’s usually obvious. Nothing neutralizes potential arguments like working code with new features, improved performance, or higher testsuite scores.
Moreover, branches are easily kept up-to-date against the trunk (or one another), with lots of repeated merging. This takes very little work with the right merging technology. The “final” merges of the completed branches have invariably been so smooth that it’s almost un-noticed, totally un-bomb-like: everyone’s seen it coming for quite a while, and made all the changes they feel are necessary.
graydon: thanks for the excellent response. Your description is really eye-opening for me, and makes me even think about recanting my whole post. It sounds like you’ve described a world that is just as “socially cohesive” as communities that use a centralized system.
Perhaps my experience with subversion is really unusual: it’s similar to what you describe. Branches are cheap, developers create them all the time, people watch and review them, and so on. I’ve been making this assumption that it’s subversion’s centralized nature that keeps our community focused and generally agreeable, and prevents random splintering. But you’ve described the same sort of community within Monotone, and are claiming that the *decentralized* system is responsible for the exact same set of nice community behaviors.
I’m starting to wonder if the SCM tool is simply irrelevant to a community’s nature!
To Ben:
To quote the documentation of granddaddy of them all: “CVS is not a substitute for developer communication”. Replace CVS with Subversion or Monotone or TLA or whatever-is-the-flavor-of-choice, and it’s still true.
And “there are no technical solutions to social problems”.
So, the tool is actually irrelevant (mostly), the community will either work or not. The fact that a tool makes some (mis)behavior possible or downright easy (maybe even enticing) has nothing to do with it, since, well, when a community doesn’t work, people will misbehave, no matter the tool.
These are very good points. The problem with most DVCSs is that they actively encourage branching. It’s their main mode of operation. Obviously, with some discipline, users can avoid branching and the ensuing painful merges.
However, this is not the problem of distribution–it’s the problem of policy.
My company, http://www.relisoft.com , makes a distributed VCS (Code Co-op. Windows only, sorry!), which takes a different approach. Instead of making it a responsibility of every developer to synchronize, it broadcasts changes to all project members. This enforces strict ordering of changes and discourages branching. Merges, if necessary, are done before check-ins, not afterwards. The person who makes the check in has full, undiluted, responsibility for its consistency.
This might seem like tough policy, but it works very well in collaborative projects, where people actually want to work together in a concerted effort, e.g., for a company. On the other hand, such model is probably unsuitable for open-source efforts where each developer is a free agent.
Although your arguments are probably valid, the title is wrong. One needs to read half of your article to find out that the dangers presented apply only for open source development projects.
I’ve used version control for 15 years, starting on an IBM mainframe.
I used Visual Source Safe from Microsoft for two years (ouch). For
open source work in the last 10 years I’ve used CVS, SVN, Arch, and git.
Axiom, an open source project, uses all four in parallel. I also use SVN in
work. Since I maintain Axiom in all four systems in parallel I know what
the challenges are for each system on identical workloads.
On the purely technical aspects I have a few comments:
I realize that SVN is your pet project but I have to say I cannot recommend
it to anyone. SVN loses work. Period. It gets into a “locked” state and the
“cleanup” command never works. The only solution (used by myself and all
of my co-workers in work) is to move the local copy, re-checkout the trunk,
diff the trees and re-apply the patches. It doesn’t happen often but it costs
me hours when it does. Source code control systems shouldn’t fail like this.
SVN is also glacially slow, as is CVS and Arch. It can take minutes to do
reasonably minor checkins (less than a megabyte). I’ve had some checkins
for work take 1/2 hour or more. I have no idea what it could be doing but
it is doing it very slowly. And considering I’m working on a 3.4GHz machine
with a direct network connection I’m fascinated by what could take so long.
SVN, like CVS, doesn’t seem to want to let me get rid of directories.
I can’t imagine why. I did a major reorganization of the tree structure
and now I have random zombie directories. Yes, I know I can tell SVN
to ignore them on checkout but that misses the point.
SVN write protects the .svn subdirectories. This broke our build when
it tried to do updates by recopying the directories. The source code
control system should not have any visible effects.
SVN has “properties” that took us a long time to get sorted. The line
ending problem for windows and the execute properties are just
painful “metadata” that I can handle in my build system. I don’t need
this information in the source code control system. Try to stick to the
main task of saving and managing my information, not my metadata.
SVN and Arch take up at least twice the space required. I can check out
Axiom sources and when I remove the hidden directories I gain at least
half the space back. I’m not sure why these two systems need local
backup copies. I think Arch uses it for “offline” diffs. For large systems
like Axiom this is expensive in disk space, bandwidth, and time.
SVN and CVS are fine for small branches but any major change that takes
more than a week to create will eventually “get lost in the weeds”. While
the major changes are happening in the branch the trunk will also get
a lot of changes in parallel. Eventually there is no way I’ve found, except
by hand, to try to get the branch work back into the main line. So for small
and quick branches they work fine. For any large change they collapse.
Arch is even worse about this but this is an SVN blog so I’ll keep to the
point.
Now on to the social implications of SVN vs other systems:
A consequence of the last point is that SVN actually encourages forking.
Once a branch gets too complex the effort required to merge it into the
trunk exceeds the effort required to maintain the branch and a fork has
implicitly happened.
SVN tracks “files” whereas git tracks changes. Thus if you move a
function from one file to another git knows that but SVN does not.
As a consequence, SVN is not well behaved when it comes time to
refactor the codebase. In any large project there are times when the
whole codebase gets refactored, usually more than once.
Revision numbers are global in SVN so activity on the branch will change
the revision number which makes it rather useless. And the branch does
not seem to know where it branched from so I have to track this myself.
The situation got even worse when we had a branch off a branch. I got
lost in the weeds trying to untangle it all.
To its credit SVN uses “changesets” so that updates can be grouped
by semantic content. Unfortunately the branch/merge issues seem to
make this less useful than it could be. In git I can create a branch to
work on a complete “changeset” and merge it back smoothly. When I
try to do that in SVN, for larger changesets that take weeks to develop,
I find I end up doing the merge by hand. So SVN’s branch/merge weakness
causes me to only make small changes and merge them back quickly. This
means that I tend to change files rather than create “changesets”. As a
social consequence it is harder to explain that this patch is really part of
a larger whole. In git I tend to merge full changesets that represent a
single idea developed over a longer period of time.
I am unaware of any project (although this is clearly due to my limited
experience with all projects) in SVN that have dozens of branches per
developer. If each branch represents a changeset for a particular idea
or bug fix and developers each have many changesets I’d expect to see
many hundreds of branches at any one time. Is there any project that
has hundreds of branches in SVN?
Git, on the other hand, allows me to have many branches in parallel
locally (I have about 30 open branches in my tree at the moment).
Thus I can explore multiple problems and bug fixes in parallel. Merging
works and I have not ever had a “lockup”. Git does not take up twice
the space (in fact, it won’t keep two copies of the same file if the hash
codes are the same). Git is blindingly fast. git-commit is local and git-push,
which does network traffic, is so fast I often think it must have failed since
it usually takes seconds. I don’t recall ever seeing a push that took a full
minute. All git information lives in the root directory so I don’t have the
read-only issue.
I must admit that having spent years in CVS and some further time in SVN
I was well steeped in the central repository mindset. It took months to
make the mental switch to using a distributed system. It’s not a technical
issue, it’s just that the “central” ideas were pretty firmly implanted.
However, having climbed the mental hill I find that git has completely
and deeply changed the way I work. I can’t often say that about any
tool since I’m very resistant to change. Only Emacs and Lisp have had
such deep changes in my work habits similar to git. I have the choice
to work in any of the four systems and push the changes to the other
three. SVN lost on the head-to-head comparison against git.
I’m aware that you probably think Linus is a “git” and I saw him trash
SVN in the google video. It doesn’t seem fair but he actually does have
an important point. I’d encourage you to try git for a while and steal
every idea you can. It can only make SVN better. (Oh, and Darcs, which
I studied for a while, has a nice calculus of patches that’s worth a look).
Sorry for the negative views but I’m an experienced, continuous user of
all four systems and I’d like them all to improve.
Tim Daly
About merging “bombs”, the cloning nature of DVCS keeps a complete history of each bomb as they develop (provided the devs commit with a sensible frequency). That would help the review much more than a huge patch with only comments and emails to explain.
I’ve been fascinated by version-control systems since SCCS days and found myself an early adopter on each wave of new technology — RCS, CVS, Subversion, and now DVCSes. I’ve been doing collaborative open development since not just before I popularized the “open source” label, but even before RMS first uttered the phrase “free software”.
My thirty years of experience as a programmer and amateur ethnologist among the hackers says graydon has absolutely nailed this issue. Project coherence is a social property, not a technical one. Code bombs are a symptom of communication failure and social problems in the project group, not of the VCS’s mode of operation.
I’m planning to move one of the projects I lead from Subversion to Mercurial in the relatively near future, not because I dislike Subversion (in years of heavy use I’ve never seen the lockup and data-loss problems Tim Daly describes) but because I want disconnected operation. (The project is a GPS-monitoring daemon; a fair number of our devs are in exotic places where Internet access to the SVN repo is spotty and expensive)
I have no worry at all that Mercurial is going to encourage code bombs and/or forking, because I know what the social dynamics of my dev group are. If those dynamics were to go bad, there is no way a centralized VCS would prevent code bombs or forking — nor even inhibit them much.
If one single tool, the DVCS, makes more natural .. coding in isolation,
then use the OTHER tools to make more natural .. coding coherently:
IRC, iirc, is one such.
( contrast IRC with e-mail )
TikiWiki might be another.
While companies like Accurev already solved the problems of the old RCS style of version control years ago, I’m curious if there is a place for DVCS for some teams where collaboration isn’t a defacto necessity?
I agree with pretty much everything you said, except that I’d stress that it isn’t just Open Source projects that need this kind of communication, it’s all projects. It’s just as bad to crawl off into a hole and produce a 20K line patch bomb in a proprietary project as it is in Open Source.
Thanks.
Both points were valid and presented in a manner that people will understand both sides. I am quiet hesitant with the idea of decentralized versioning.
7 years later, all this worry seems … quaint. =)