The Risks of Distributed Version Control
It’s funny how times change. When we started writing Subversion five years ago, CVS was the big evil beast that we were aiming to “subvert”. These days, while Subversion still has a long way to go in performance and features, it has reached critical mass in the open source world. The users@ list has thousands of subscribers and has become self-supporting. Every major free operating system ships with a relatively recent version of subversion. There are several books in the bookstore about Subversion. Major projects like KDE, Apache, and GCC have switched to it, along with dozens of others. When you run across an open source project using Subversion these days, it’s no longer a novelty. It’s become the default safe choice for most new projects.
And now, lo and behold, a whole new generation of version control systems has appeared on the horizon: arch, codeville, monotone, bazaar-ng, svk, git, mercurial. These are the new kids in town — the Distributed Version Control systems — and they aim to unseat the establishment. Yes, you heard right: Subversion is now The Man. I wonder when that happened?
What makes this new generation of systems fundamentally different is that they take the idea of “disconnected operations” to the extreme. Every user has an entire copy of the repository — 100% of a project’s history — stored on the local computer. Each person is effectively an island onto themselves. Users connect their private repostories together in any way they wish and trade changes like baseball cards; the system automatically tracks which changes you have and which ones you don’t.
There’s something fresh and self-empowering about this model, because it’s a superset of CVS and Subversion’s traditional single-repository model. An open source project can decide that exactly one repository is the Master, and expect all participants to push and pull changes from that master repository as needed. Of course, a project can also organize itself into more interesting shapes: a tree-like hierarchy of repositories, a ring of repositories, or even just a randomly connected graph. It’s tremendously flexible.
Proponents of these systems tend to be a bit fanatical about their “superiority” over today’s centralized systems. Over and over, I hear testimonials like this:
“It’s great! If I want to implement a new feature, I don’t need to have commit access at all. I have my own private copy of the repository, so I can write the whole thing by myself. I commit changes to my private repository as often as I want until it’s all finished. Then I can present the whole thing to everyone else.”
This user is describing a great convenience, but I view it in a slightly darker light. Notice what this user is now able to do: he wants to to crawl off into a cave, work for weeks on a complex feature by himself, then present it as a polished result to the main codebase. And this is exactly the sort of behavior that I think is bad for open source communities. Open source communities need to work together. They need to agree on common goals, discuss designs, and constantly review each other’s work.
In the subversion community, we call the behavior above “dropping a bomb”. It’s considered anti-social and anti-cooperative. Usually the new feature is so big and complex, it’s nearly impossible to review. If it’s hard to review, then it’s hard to accept into the main codebase, hard to maintain code quality, and hard for anyone but the original author to maintain the feature. When this happens, we typically scold the person(s) for not working out in the open.
Good Behavior, on the other hand, involves coming to the community with a design proposal. After some discussion, we ask the developer(s) to either (1) submit a series of patches as work progresses, or (2) give him (or them) a private branch to work on. They needn’t have commit-access to the core code — a branch is all that’s needed. That way the larger community can review the smaller commits as they come in, discuss, give feedback, and keep the developers in the loop. The main goal here is never to be surprised by some huge code change. It keeps the community focused on common goals and aware of each other’s progress.
So while most people say: “isn’t it great that I can fork the whole project without anyone knowing! ” My reaction is, “yikes, why aren’t you working with everyone else? why aren’t you asking for commit access?” This is a problem that is solved socially: projects should actively encourage side-work of small teams and grant access to private branches early and often.
I probably sound like Mr. Anti-Distributed-Version-Control, but I’m really not. It’s definitely cool, and definitely convenient. I just think it’s something that needs to be used very carefully, because the very conveniences it provides also promote fragmentary social behaviors that aren’t healthy for open source communities.
For more on this subject, see this essay by Greg Hudson — it was the writing which originally had my head nodding on this topic. Also relevant is Karl Fogel’s excellent new book, Producing Open Source Software. It’s all about managing and promoting healthy open source developer communities.