Mercurial on Google Code

Friday, 24 April, 2009

You’re not actually surprised are you? 🙂

Read the official blog post for details.

But yes, this is the project I’ve been leading for the last 9 months. I haven’t written any code, but instead it’s been my first chance to really be a ‘tech lead’ (translation: manager) for some truly brilliant programmers on my team. The mercurial-on-bigtable implementation is top-notch.

Note that the feature isn’t finished yet — lots of missing things, lots of bugs to fix still. We’ve not yet fully launched to the public. But you can sign up to be an ‘invited tester’ (if you’re willing to give us feedback), and meanwhile we’ll continue to finish the feature in the public view.

a Mercurial “super client”

Tuesday, 14 October, 2008

One of the cool trends I’ve seen is the use of distributed version control systems as “super clients” against Subversion. You suck down the entire history of a Subversion repository into a local, private repository, do all of your commits locally, make branches, experiment all you want, then “push” back up to Subversion again. On the internet, nobody knows you’ve been using DVCS (or that you’re a dog.) What’s particularly cool about these bridging tools is that they allow users to try out DVCS before deciding to officially convert a whole project over. Or, if a project happens to be using Subversion but you still want most of the power of a DVCS for local work, it’s a perfect solution.

For all the blabbing I’ve done about distributed version control systems, I’m still a big fan of Mercurial. Of all the DVCSes, I think it’s the easiest to learn for svn users. It has a small, tight set of commands, and the community which runs the project is polite and sane.

In any case, there have been a collection of Mercurial-Subversion bridges available for the last couple of years, but they’ve all been deficient in various ways: either not capturing svn history entirely, or being unable to push back to svn correctly (or only very awkwardly). So I’ve pretty much stayed away. But today I want to plug a new bridge written by a friend of mine (Augie Fackler) who finally did it Right: he wrote a bridge called hgsubversion which (1) uses the actual Subversion API to pull history down (which is faster, more accurate, and long-term sustainable), and (2) actually knows how to push changes back to Subversion correctly. I want the world to be aware of this tool, because I think it’s the first Mercurial-Subversion bridge which deserves to be promoted into the popular ranks with tools like git-svn.

The tool is still young and not generally installable by the public (i.e. you’re not going to find any magic .rpm, .dpkg, .zip or .dmg for it yet)… but here are my cliff notes if you want to start playing with it.

Requirements

  • The latest (unreleased) Mercurial
  • Local Subversion libraries, at least 1.5, with swig-python bindings built
  • A Subversion server that is 1.4 or later

To get the latest Mercurial:

$ hg clone http://selenic.com/repo/hg hg-latest
$ cd hg-latest
$ make
$ sudo make install

To get the latest Subversion python bindings:

$ # if you don't have a binary package for svn-1.5-python-bindings already,
$ # this is a summary of subversion/bindings/swig/INSTALL instructions:
$ svn checkout http://svn.collab.net/repos/svn/tags/1.5.3 svn
$ cd svn
$ ./autogen.sh && ./configure
$ make
$ sudo make install
$ make swig-py # make sure you have swig 1.3 installed already
$ make check-swig-py
$ sudo make install-swig-py

To get hgsubversion:

$ hg clone http://bitbucket.org/durin42/hgsubversion/ ~/hgsubversion
$ cat >> ~/.hgrc
[extensions]
rebase=
svn=/home/user/hgsubversion
^D

To make sure you’re ready to go, do a final sanity check:


$ python -c "import svn.core; print svn.core.SVN_VER_MINOR"
5
$ # if you get something less than 5, you may have conflicting
$ # versions installed, and may need to set PYTHONPATH

Now you can clone your favorite svn repository, and use it locally:


$ hg svnclone http://svn.example.com/repos hg-repos
converting r1 by joe
A trunk/
committed as 24dfb7b51d606a921333e2b8f19a9a6aa5661a69 on branch default
[...]
converting r100 by jane
M trunk/goo
M trunk/moo
committed as 54dfb7b51d6d6a931333e2b8f19a9a6005661a62 on branch default
$

The tool currently assumes a ‘standard’ svn layout of /trunk, /branches, /tags, and then tries to pull them into sane mercurial equivalents. After you’ve made a bunch of local commits, you can push the changes back to subversion:


$ # First merge the latest public svn changes into your repository:
$ hg svn pull
converting r101 by pinky
A trunk/awesomefile
committed as e85afd44dc83d5df2599157096a95b0868de6955 on branch default
$ hg up -C
3 files updated, 0 files merged, 1 files removed, 0 files unresolved
$ # We now have two hg HEADs, because the public svn changes are
$ # considered a different line of development from my own.
$ # For now, rebasing is preferred to merging:
$ hg up -C 9985f017b3ab
3 files updated, 0 files merged, 1 files removed, 0 files unresolved
$ hg svn rebase
saving bundle to /home/user/project/.hg/strip-backup/c4b9dfce6b09-temp
adding branch
adding changesets
adding manifests
adding file changes
added 4 changesets with 4 changes to 4 files
rebase completed
$ # At the moment, each changeset is pushed separately;
$ # changeset flattening not yet implemented
$ hg svn push

Subversion moving to the Apache Software Foundation

Thursday, 5 November, 2009

It’s no longer a secret, but now a public press release.

Not that this should shock anybody, but in case you didn’t know, now you do. The overlap between Apache and Subversion communities has always been huge since day one — with essentially identical cultures. We’ve talked about doing this for years. It means we can finally dissolve the ‘Subversion corporation’ and let ASF handle all our finances and legal needs.

“Why didn’t this happen sooner? Why now?”, you may ask. There are several answers.

First, the intellectual property was scattered. Collabnet owned a huge chunk of it, but so did other corporations and a large handful of other random volunteers from the internet. The ASF requires software grants to join, and we didn’t have our eggs in one basket.

Second, when the Subversion project first developed legal needs a few years ago — and also started receiving money from Google’s Summer of Code — it was relatively easy to set up our own non-profit. It gave us a place for money to live, and an entity to defend the Subversion trademark from a number of abusive third parties.

But over time, running our own non-profit turned out to be an awkward time suck. So about a year ago I started focusing on collecting Contributor License Agreements (CLAs) from both individuals and corporations, including Collabnet itself. Once the IP was all concentrated in the Subversion Corporation, it freed us up to move to the ASF of dump all of the bureaucracy on them. 🙂

So this announcement is also a bit of a point of pride for myself. I’ve long stopped working on Subversion code, but I wanted to make sure the project was parked in a good place before I could really walk away guilt-free. I now feel like my “work is done”, and that the ASF will be an excellent long-term home for the project. This is exactly what the ASF specializes in: being a financial and legal umbrella for a host of communities over the long haul. The project is in excellent hands now.

Of course, Collabnet has always been the main supplier of “human capital” for the project in terms of full-time programmers writing code, and that’s not going to change as far as I can see. Collabnet deserves huge kudos for the massive financial investment (and risk) in funding this project for nearly 10 years, and it seems clear they’re going to continue to be the “center” of project direction and corporate support for years to come. And this pattern isn’t uncommon either: the Apache HTTPD Server itself is mostly made up of committers working on behalf of interested corporations.

What’s interesting to me, however, are all the comments on the net about how this is a “death knell” for Subversion — as though the ASF were some sort of graveyard. That seems like a very typical viewpoint from the open source universe — mistaking mature software like Apache or Subversion (or anything not new and shiny) for “old and crappy”. In my opinion, the open source world seems to ignore the other 90% of programmers working in tiny software shops that utterly rely on these technologies as foundational. Even though I’ve become a Mercurial user myself, I can assure you that these other products aren’t going away anytime soon!

Hm. I smell another talk here.

Git.

Tuesday, 1 September, 2009

(Apologies to the original poster. I noticed the ‘name’ line of the git manpage today and got inspired.)


From: sussman@red-bean.com (Ben Collins-Sussman)
Sender: cooks@red-bean.com
Subject: The True Path (long)
Date: 01 Sep 09 03:17:31 GMT
Newsgroups: alt.religion.version-control

When I log into my SunOS 4.2 system with my 28.8kbps modem, both svn
*and* hg are just too damn slow. They print useless messages like,
"Type 'svn help' for usage" and "abort: There is no Mercurial
repository here". So I use the version control system that doesn't
waste my VALUABLE time.

git, man! !man git

GIT(7) Git Manual GIT(7)

NAME
git - the stupid content tracker

SYNOPSIS
git [--version] [--exec-path[=GIT_EXEC_PATH]]
[-p|--paginate|--no-pager]
[--bare] [--git-dir=GIT_DIR] [--work-tree=GIT_WORK_TREE]
[--help] COMMAND [ARGS]

DESCRIPTION
Git is a fast, scalable, distributed revision control
system with an unusually rich command set that provides both
high-level operations and full access to internals.

---

Computer Scientists love git, not just because it comes first
alphabetically, but because it's stupid. Everyone else loves git
because it's GIT!

"Git is the stupid content tracker."

And git doesn't waste space on my Newton MessagePad. Just look:

-rwxr-xr-x 1 root 24 Oct 29 2009 /bin/git
-rwxr-xr-t 4 root 1310720 Jan 1 2005 /usr/bin/hg
-rwxr-xr-x 1 root 5.89824e37 Oct 22 2001 /usr/local/subversion/bin/svn

Of course, on the system *I* administrate, hg is symlinked to git.
svn has been replaced by a shell script which 1) Generates a syslog
message at level LOG_EMERG; 2) reduces the user's disk quota by 10GB;
and 3) RUNS GIT!!!!!!

"Git is the stupid content tracker."

Let's look at a typical novice's session with the mighty git:

$ git add *
fatal: Not a git repository

$ git checkout
fatal: Not a git repository
Failed to find a valid git directory.

$ git git
git: 'git' is not a git-command. See 'git --help'.

$ git --help

$ git over here
git: 'over' is not a git-command. See 'git --help'.

$ git "eat flaming death"

---
Note the consistent user interface and error reportage. Git is
generous enough to flag errors and pack repositories as dense as
neutron stars, yet prudent enough not to overwhelm the novice with
useless details. If users REALLY want to know what git commands are
available, a simple 'man git' will reveal them all, sheer genius
in its simplicity:

git-add(1)
git-am(1)
git-archive(1)
git-bisect(1)
git-branch(1)
git-bundle(1)
git-checkout(1)
git-cherry-pick(1)
git-citool(1)
git-clean(1)
git-clone(1)
git-commit(1)
git-describe(1)
git-diff(1)
git-fetch(1)
git-format-patch(1)
git-gc(1)
git-grep(1)
git-gui(1)
git-init(1)
git-log(1)
git-merge(1)
git-mv(1)
git-pull(1)
git-push(1)
git-rebase(1)
git-reset(1)
git-revert(1)
git-rm(1)
git-shortlog(1)
git-show(1)
git-stash(1)
git-status(1)
git-submodule(1)
git-tag(1)
gitk(1)
git-config(1)
git-fast-export(1)
git-fast-import(1)
git-filter-branch(1)
git-lost-found(1)
git-mergetool(1)
git-pack-refs(1)
git-prune(1)
git-reflog(1)
git-relink(1)
git-remote(1)
git-repack(1)
git-repo-config(1)
git-annotate(1)
git-blame(1)
git-cherry(1)
git-count-objects(1)
git-fsck(1)
git-get-tar-commit-id(1)
git-help(1)
git-instaweb(1)
git-merge-tree(1)
git-rerere(1)
git-rev-parse(1)
git-show-branch(1)
git-verify-tag(1)
git-whatchanged(1)
git-archimport(1)
git-cvsexportcommit(1)
git-cvsimport(1)
git-cvsserver(1)
git-imap-send(1)
git-quiltimport(1)
git-request-pull(1)
git-send-email(1)
git-svn(1)
git-apply(1)
git-checkout-index(1)
git-commit-tree(1)
git-hash-object(1)
git-index-pack(1)
git-merge-file(1)
git-merge-index(1)
git-mktag(1)
git-mktree(1)
git-pack-objects(1)
git-prune-packed(1)
git-read-tree(1)
git-symbolic-ref(1)
git-unpack-objects(1)
git-update-index(1)
git-update-ref(1)
git-write-tree(1)
git-cat-file(1)
git-diff-files(1)
git-diff-index(1)
git-diff-tree(1)
git-for-each-ref(1)
git-ls-files(1)
git-ls-remote(1)
git-ls-tree(1)
git-merge-base(1)
git-name-rev(1)
git-pack-redundant(1)
git-rev-list(1)
git-show-index(1)
git-show-ref(1)
git-tar-tree(1)
git-unpack-file(1)
git-var(1)
git-verify-pack(1)
git-daemon(1)
git-fetch-pack(1)
git-send-pack(1)
git-update-server-info(1)
git-http-fetch(1)
git-http-push(1)
git-parse-remote(1)
git-receive-pack(1)
git-shell(1)
git-upload-archive(1)
git-upload-pack(1)
git-check-attr(1)
git-check-ref-format(1)
git-fmt-merge-msg(1)
git-mailinfo(1)
git-mailsplit(1)
git-merge-one-file(1)
git-patch-id(1)
git-peek-remote(1)
git-sh-setup(1)
git-stripspace(1)

"Git is the stupid content tracker."

Git, the greatest WYGIWYG revision control system of all.

GIT IS THE TRUE PATH TO NIRVANA! GIT HAS BEEN THE CHOICE OF EDUCATED
AND IGNORANT ALIKE FOR CENTURIES! GIT WILL NOT CORRUPT YOUR PRECIOUS
BODILY FLUIDS!! GIT IS THE STUPID CONTENT TRACKER! GIT MAKES THE SUN
SHINE AND THE BIRDS SING AND THE GRASS GREEN!! GIT WAS HANDED DOWN TO
US FROM LINUS UPON THE MOUNTAIN, AND LINUX USERS SHALL NOT WORSHIP ANY
OTHER TRACKER!

When I use a version control system, I don't want eight extra
MEGABYTES of worthless HTTP protocol support. I just want to GIT on
with my coding! I don't want to subvert away or mercurialize!
Those aren't even WORDS!!! GIT! GIT! GIT IS THE STUPID!!!

CONTENT TRACKER.

When Linus, in his ever-present omnipotence, needed to base his patch
juggling habits on existing tools, did he mimic svn? No. Hg? Surely
you jest. He created the most karmic version tracker of all. The
stupid one.

Git is for those who can *remember* what project they are working on.
If you are an idiot, you should use subversion. If you are
subversive, you should not be mercurial. If you use GIT, you are on
THE PATH TO REDEMPTION. THE SO-CALLED "FRIENDLY" SCM SYSTEMS HAVE
BEEN PLACED HERE BY GIT TO TEMPT THE FAITHLESS. DO NOT GIVE IN!!! THE
MIGHTY LINUS HAS SPOKEN!!!

?

Programmer Insecurity

Thursday, 12 June, 2008

I’ve got a lot to say today!

I want to chat about something that I’ve never noticed before, but probably should have. There’s always been a stereotype out there of programmers being nerdy, anti-social people (Q: How do you know when an engineer is outgoing? A: He looks at your shoes!). But my revelation of the week is that most programmers seem to be really insecure about their work. I mean: really, really insecure.

My buddy Fitz and I have long preached about best practices in open source software development — how one should be open and transparent with one’s work, accept code reviews, give constructive criticism, and generally communicate as actively as possible with peers. One of the main community “anti-patterns” we’ve talked about is people writing “code bombs”. That is, what do you do when somebody shows up to an open source project with a gigantic new feature that took months to write? Who has the time to review thousands of lines of code? What if there was a bad design decision made early in the process — does it even make sense to point it out? Dropping code-bombs on communities is rarely good for the project: the team is either forced to reject it outright, or accept it and deal with a giant opaque blob that is hard to understand, change, or maintain. It moves the project decidedly in one direction without much discussion or consensus.

And yet over and over, I’m gathering stories that point to the fact that programmers do not want to write code out in the open. Programmers don’t want their peers to see mistakes or failures. They want to work privately, in a cave, then spring “perfect” code on their community, as if no mistakes had ever been made. I don’t think it’s hubris so much as fear of embarrassment. Rather than think of programming as an inherently social activity, most coders seem to treat it as an arena for personal heroics, and will do anything to protect that myth. They’re fine with sharing code, as long as they present themselves as infallible, it seems. Maybe it’s just human nature.

Check out some of these stories I’ve collected:

  • Requests at the Google I/O booth: A couple of weeks ago when my team was at the Google I/O conference, we ran a booth demonstrating our Open Source Project Hosting service. Over and over, we kept getting requests like this:

    “Can you guys please give subversion on Google Code the ability to hide specific branches?”

    “Can you guys make it possible to create open source projects that start out hidden to the world, then get ‘revealed’ when they’re ready?”

    Translation: “I don’t want people to see my work-in-progress until it’s perfect.”

  • Requests on the Google Code mailing list: Sometimes users need their googlecode.com svn repositories wiped clean. Legitimate reasons include the accidental commit of sensitive data, or the need to load code history in from a different svn repository. But most of the time we get (invalid) requests like this:

    “Hi, I want to rewrite all my code from scratch, can you please wipe all the history?”

    Translation: “I don’t want people to be able to find my old code, it’s too embarrassing.” Call it vanity, call it insecurity… the bottom line is that coders want prior mistakes or failures to be erased from history.

  • Code-reviews taken as personal attacks. Fitz tells a funny anecdote about a friend of his who went from the open source world to a corporate software job. Vastly paraphrased:

    During his first week, he started emailing friendly code reviews to each of his coworkers, receiving strange stares in turn. Eventually his boss called him into his office:

    “You know, you really need to stop with the negative energy. Your peers say that you’re constantly criticizing everything they do.”

    Moral: not only is code review not the norm in corporate environments, most programmers are unable to separate their fragile egos from the code they write. Repeat after me: you are not your code!

  • Distributed version control — in a cave. A friend of mine works on several projects that use git or mercurial. He gave me this story recently. Basically, he was working with two groups on a project. One group published changes frequently…

    “…and as a result, I was able to review consistently throughout the semester, offering design tweaks and code reviews regularly. And as a result of that, [their work] is now in the mainline, and mostly functional. The other group […] I haven’t heard a peep out of for 5 months. Despite many emails and IRC conversations inviting them to discuss their design and publish changes regularly, there is not a single line of code anywhere that I can see it. […] Last weekend, one of them walked up to me with a bug […] and I finally got to see the code to help them debug. I failed, because there are about 5000 lines of crappy code, and just reading through a single file I pointed out two or three major design flaws and a dozen wonky implementation issues. I had admonished them many times during these 5 months to publish their changes, so that we (the others) could take a look and offer feedback… but each time met with stony silence. I don’t know if they were afraid to publish it, or just don’t care. But either way, given the code I’ve seen, the net result is 5 wasted months.”

    Before you scream; yes yes, I know that the potential for cave-hiding and writing code bombs is also possible with a centralized version control system like Subversion. But my friend has an interesting point:

    “I think this failure is at least partially due to the fact that [DVCS] makes it so damn easy to wall yourself into a cave. Had we been using svn, I think the barrier to caving would have been too high, and I’d have seen the code.”

    In other words, yes, this was fundamentally a social problem. A team was embarrassed to share code. But because they were using distributed version control, it gave them a sense of false security. “See, we’re committing changes to our repository every day… making progress!” If they had been using Subversion, it’s much less likely they would have sat on a 5000 line patch in their working copy for 5 months; they would have had to share the work much earlier. Moral: even though one shouldn’t depend on technical solutions to social problems, default tool behaviors matter a lot. This was my main theme way back when I wrote about the risks of distributed version control.

OK, so what’s the conclusion here? People are scared of sharing their unfinished work, plain and simple. I know this isn’t headline news to most people, but I really think I’ve been in deep denial about this. I’m so used to throwing my creative output up for constant criticism, that I simply expect everyone else to do it as well. I think of it as the norm, and I can’t comprehend why someone wouldn’t want to do that… and yet clearly, the growing popularity of distributed version control shows just how thrilled people are to hide their work from each other. It’s the classic “testimonial” for systems like git (taken from a blog comment):

“Don’t tell me I should cooperate with other people at the beginning and publish my modification as early as possible. I do cooperate with other people but I do want to do some work alone sometimes.”

Hm, okay. Please just don’t work alone for too long!

Sidetracking a wee bit, I think this is why I prefer Mercurial over Git, having done a bit of research and reading on both systems. Git leans much more heavily towards cave-hiding, and I don’t like that. For example, the ‘git rebase’ command is a way of effectively destroying an entire line of history: very powerful, sure, but it’s also a way of erasing your tracks. Rather than being forced to merge your branch into a parent line, just pretend that your branch was always based on the latest parent line! Another example: when it comes to pushing and pulling changesets, Mercurial’s default behavior is to exchange all history with the remote repository, while git’s default behavior is to only push or pull a single branch — presumably one that the user has deemed fit for sharing with the public. In other words, git defaults to all work being private cave-work, and is happy to destroy history. Mercurial shares everything by default, and cannot erase history.

I know this post has been long, but let me stand on my soapbox for a moment.

Be transparent. Share your work constantly. Solicit feedback. Appreciate critiques. Let other people point out your mistakes. You are not your code. Do not be afraid of day-to-day failures — learn from them. (As they say at Google, “don’t run from failure — fail often, fail quickly, and learn.”) Cherish your history, both the successes and mistakes. All of these behaviors are the way to get better at programming. If you don’t follow them, you’re cheating your own personal development.

Phew, I feel better now.

(This page has been translated into Spanish language by Maria Ramos.)

Subversion 1.5 merge-tracking in a nutshell

Saturday, 10 May, 2008

As I’ve mentioned in other posts, the Subversion project is on the verge of releasing version 1.5, a culmination of nearly two years of work. The release is jam-packed with some huge new features, but the one everyone’s excited about is “merge tracking”.

Merge-tracking is when your version control system keeps track of how lines of development (branches) diverge and re-form together. Historically, open source tools such as CVS and Subversion haven’t done this at all; they’ve relied on “advanced” users carefully examining history and typing arcane commands with just the right arguments. Branching and merging is possible, but it sure ain’t easy. Of course, distributed version control systems have now started to remove the fear and paranoia around branching and merging—they’re actually designed around merging as a core competency. While Subversion 1.5 doesn’t make it merging as easy as a system like Git or Mercurial, it certainly solves common points of pain. As a famous quote goes, “it makes easy things easy, and hard things possible.” Subversion is now beginning to match features in larger, commercial tools such as Clearcase and Perforce.

My collaborators and I are gearing up to release a 2nd Edition of the free online Subversion book soon (and you should be able to buy it from O’Reilly in hardcopy this summer.) If you want gritty details about how merging works, you can glance over Chapter 4 right now, but I thought a “nutshell” summary would make a great short blog post, just to show people how easy the common case now is.

  1. Make a branch for your experimental work:

    $ svn cp trunkURL branchURL
    $ svn switch branchURL

  2. Work on the branch for a while:

    # ...edit files
    $ svn commit
    # ...edit files
    $ svn commit

  3. Sync your branch with the trunk, so it doesn’t fall behind:

    $ svn merge trunkURL
    --- Merging r3452 through r3580 into '.':
    U button.c
    U integer.c
    ...

    $ svn commit

  4. Repeat the prior two steps until you’re done coding.
  5. Merge your branch back into the trunk:

    $ svn switch trunkURL
    $ svn merge --reintegrate branchURL
    --- Merging differences between repository URLs into '.':
    U button.c
    U integer.c
    ...

    $ svn commit

  6. Go have a beer, and live in fear of feature branches no more.

Notice how I never had to type a single revision number in my example: Subversion 1.5 knows when the branch was created, which changes need to be synced from branch to trunk, and which changes need to be merged back into the trunk when I’m done. It’s all magic now. This is how it should have been in the first place. 🙂

Subversion 1.5 isn’t officially released yet, but we’re looking for people to test one of our final release candidate source tarballs. CollabNet has also created some nice binary packages for testing, as part of their early adopter program. Try it out and report any bugs!

Subversion’s Future?

Tuesday, 29 April, 2008

According to Google Analytics, one of the most heavily trafficked posts on my blog is the one I wrote years ago, the Risks of Distributed Version Control. It’s full of a lot of semi-angry comments about how wrong I am. I thought I would follow up to that post with some newer thoughts and news.

I have to say, after using Mercurial for a bit, I think distributed version control is pretty neat stuff. As Subversion tests a final release candidate for 1.5 (which features limited merge-tracking abilities), there’s a bit of angst going on in the Subversion developer community about what exactly the future of Subversion is. Mercurial and Git are everywhere, getting more popular all the time (certainly among the 20% trailblazers). What role does Subversion — a “best of breed” centralized version control system — have in a world where everyone is slowly moving to decentralized systems? Subversion has clearly accomplished the mission we established back in 2000 (“to replace CVS”). But you can’t hold still. If Subversion doesn’t have a clear mission going into the future, it will be replaced by something shinier. It might be Mercurial or Git, or maybe something else. Ideally, Subversion would replace itself. 🙂 If we were to design Subversion 2.0, how would we do it?

Last week one of our developers wrote an elegant email that summarizes a potential new mission statement very well. You should really read the whole thing here. Here’s a nice excerpt:

I'm pretty confident that, for a new open source project of non-huge
size, I would not choose Subversion to host it [...]
 
So does that mean Subversion is dead? That we should all jump ship
and just write a new front-end for git and make sure it runs on
windows?

Nah. Centralized version control is still good for some things:

* Working on huge projects where putting all of the *current* source
  code on everyone's machine is infeasible, let alone complete
  history (but where atomic commits across arbitrary pieces of the
  project are required).
* Read authorization! A client/server model is pretty key if you
  just plain aren't allowed to give everyone all the data. (Sure,
  there are theoretical ways to do read authorization in distributed
  systems, but they aren't that easy.)

My opinion? The Subversion project shouldn't spend any more time
trying to make Subversion a better version control tool for non-huge
open source projects. Subversion is already decent for that task, and
other tools have greater potential than it. We need to focus on
making Subversion the best tool for organizations whose users need to
interact with repositories in complex ways[...]

I’ve chatted with other developers, and we’ve all come to some similar private conclusions about Subversion’s future. First, we think that this will probably be the “final” centralized system that gets written in the open source world — it represents the end-of-the-line for this model of code collaboration. It will continue to be used for many years, but specifically it will gain huge mindshare in the corporate world, while (eventually) losing mindshare to distributed systems in the open-source arena. Those of us living in the open source universe really have a skewed view of reality. From where we stand, it may seem like “everyone’s switching to git”, but then when you look at a graph like the one below (which shows all public (not private!) Apache Subversion servers discoverable on the internet), you can see that Subversion isn’t anywhere near “fading away”. Quite the opposite: its adoption is still growing quadratically in the corporate world, with no sign of slowing down. This is happening independently of open source trailblazers losing interest in it. It may end up becoming a mainly “corporate” open source project (that is, all development funded by corporations that depend on it), but that’s a fine way for a piece of mature software to settle down. 🙂

Version Control and the… Long Gradated Scale

Tuesday, 27 November, 2007

My previous post about version control and the 80% deserves a follow-up post, mainly because it caused such an uproar, and because I don’t want people to think I’m an ignorant narcissist. Some people agreed with my post, but a huge number of people took offense at my gross generalizations. I’ve seen endless comments on my post (as well as the supporting post by Jeff Atwood) where people are either trying to decide if they’re in the “80%” or in the “20%”, or are calling foul on the pompous assertion that everyone fits into those two categories.

So let me begin by apologizing. It’s all too easy to read the post and think that my thesis is “80% of programmers are stupid mouth-breathing followers, and 20% are cool smart people like me.” Obviously, I don’t believe that. 🙂 Despite the disclaimer at the top of the post (stating that I was deliberately making “oversimplified stereotypes” to illustrate a point), the writing device wasn’t worth it; I simply offended too many people. The world is grey, of course, and every programmer is different. Particular interests don’t make you more or less “20%”, and it’s impossible to point to a team of coders within an organization and make ridiculous statements like “this team is clearly a bunch of dumb 80% people”. Nothing is ever so clear cut as that.

And yet, despite the fact that we’re all unique and beautiful snowflakes, we all have some sort of vague platonic notion of the “alpha geek”. Over time, I’ve come to my own sort of intuition about identifying the degree to which someone is an alpha-geek. I read a lot of resumes and interview a huge number of engineering candidates at work, and the main question I ask myself after the interview is: “if this person were independently wealthy and didn’t need a job at all, would they still be writing software for fun?” In other words, does the person have an inherent passion for programming as an art? That’s the sort of thing that leads to {open-source participation, writing lisp compilers, [insert geeky activity here]}. This is the basis for my super-exaggerated 80/20 metaphor in my prior post, and hopefully a less offensive way of describing it.

That said, my experience with the software industry is that the majority of people who write software for a living do not have a deep passion for the craft of programming, and don’t do it for fun. They consume and use tools written by other people, and the tools need to be really user-friendly before they get adopted. As others have pointed out, they need to just work out of the box. The main point I was trying to make was that distributed version control systems (DVCS) haven’t reached that friendliness point yet, and Subversion is only just starting to reach that level (thanks to clients like TortoiseSVN). I subscribe to a custom Google Alert about my corner of the software world, meaning that anytime Google finds a new web page that mentions Subversion or version control, I get notified about it. You would be simply astounded at the number of new blog posts I see everyday that essentially say “Hey, maybe our team should start using version control! Subversion seems pretty usable, have you tried it yet?” I see close to zero penetration of DVCS into this world: that’s the next big challenge for DVCS as it matures.

Others have pointed out that while I scream for DVCS evangelists not to thoughtlessly trash centralized systems like Subversion, I’m busy thoughtlessly trashing DVCS! I certainly hope this isn’t the case; I’ve used Mercurial a bit here and there, and perhaps my former assertions are simply based on old information. I had previously complained that most DVCS systems don’t run on Windows, don’t have easy access control, and don’t have nice GUI clients. Looking at wikipedia, I sure seem to be wrong. 🙂

Version Control and “the 80%”

Tuesday, 16 October, 2007

11/17/07: Before posting an angry comment about this post, please see the follow-up post!

Disclaimer: I’m going to make some crazy sweeping generalizations — ones which are based on my 12 years of observing the software development industry. I’m aware that I’m drawing some oversimplified stereotypes, but I think most of my peers who work in this industry will nod their head at some point, able to see the grains of truth in my characterizations.

Two Types of Programmers

There are two “classes” of programmers in the world of software development: I’m going to call them the 20% and the 80%.

The 20% folks are what many would call “alpha” programmers — the leaders, trailblazers, trendsetters, the kind of folks that places like Google and Fog Creek software are obsessed with hiring. These folks were the first ones to install Linux at home in the 90’s; the people who write lisp compilers and learn Haskell on weekends “just for fun”; they actively participate in open source projects; they’re always aware of the latest, coolest new trends in programming and tools.

The 80% folks make up the bulk of the software development industry. They’re not stupid; they’re merely vocational. They went to school, learned just enough Java/C#/C++, then got a job writing internal apps for banks, governments, travel firms, law firms, etc. The world usually never sees their software. They use whatever tools Microsoft hands down to them — usally VS.NET if they’re doing C++, or maybe a GUI IDE like Eclipse or IntelliJ for Java development. They’ve never used Linux, and aren’t very interested in it anyway. Many have never even used version control. If they have, it’s only whatever tool shipped in the Microsoft box (like SourceSafe), or some ancient thing handed down to them. They know exactly enough to get their job done, then go home on the weekend and forget about computers.

Shocking statement #1: Most of the software industry is made up of 80% programmers. Yes, most of the world is small Windows development shops, or small firms hiring internal programmers. Most companies have a few 20% folks, and they’re usually the ones lobbying against pointy-haired bosses to change policies, or upgrade tools, or to use a sane version-control system.

Shocking statement #2: Most alpha-geeks forget about shocking statement #1. People who work on open source software, participate in passionate cryptography arguments on Slashdot, and download the latest GIT releases are extremely likely to lose sight of the fact that “the 80%” exists at all. They get all excited about the latest Linux distro or AJAX toolkit or distributed SCM system, spend all weekend on it, blog about it… and then are confounded about why they can’t get their office to start using it.

I will be the first to admit that I completely lost sight of the 80% as well. When I was first hired by Collabnet to “design a replacement for CVS” back in 2000, my two collaborators and I were really excited. All the 20% folks were using CVS, especially for open source projects. We viewed this as an opportunity to win the hearts and minds of the open source world, and to especially attract the attention of all those alpha-geeks. But things turned out differently. When we finally released Subversion 1.0 in early 2004, guess what happened? Did we have flocks of 20% people converting open source projects to Subversion? No, actually, just a few small projects did that. Instead, we were overwhelmed with dozens of small companies tossing out Microsoft SourceSafe, and hundreds of 80% people flocking to our user lists for tech support.

Today, Subversion has now gone from “cool subversive product” to “the default safe choice” for both 80% and 20% audiences. The 80% companies who were once using crappy version control (or no version control at all) are now blogging to one another — web developers giving “hot tips” to each other about using version control (and Subversion in particular) to manage their web sites at their small web-development shops. What was once new and hot to 20% people has finally trickled down to everyday-tool status among the 80%.

The great irony here (as Karl Fogel points out in one of his recent OSCON slides) is that Subversion was originally intended to subvert the open source world. It’s done that to a reasonable degree, but it’s proven far more subversive in the corporate world!

Enter Distributed Version Control

In 2007, Distributed Version Control Systems (DVCS) are all the range among the alpha-geeks. They’re thrilled with tools like git, mercurial, bazaar-ng, darcs, monotone… and they view Subversion as a dinosaur. Bleeding-edge open source projects are switching to DVCS. Many of these early adopters come off as either incredibly pretentious and self-righteous (like Linus Torvalds!), or are just obnoxious fanboys who love DVCS because it’s new and shiny.

And what’s not to love about DVCS? It is really cool. It liberates users, empowers them to work in disconnected situations, makes branching and merging into trivial operations.

Shocking statement #3: No matter how cool DVCS is, anyone who tells you that DVCS is perfect for everyone is completely out of touch with reality.

Why? Because (1) DVCS has tradeoffs that are not appropriate for all teams, and (2) DVCS completely blows over the head of the 80%.

Let’s talk about tradeoffs first. While DVCS dramatically lowers the bar for participation in a project (just clone the repository and start making local commits!), it also encourages anti-social behavior. I already wrote a long essay about this (see The Risks of Distributed Version Control). In a nutshell: with a centralized system, people are forced to collaborate and review each other’s work; in a decentralized system, the default behavior is for each developer to privately fork the project. They have to put in some extra effort to share code and organize themselves into some sort of collaborative structure. Yes, I’m aware that a DVCS is able to emulate a centralized system; but defaults matter. The default action is to fork, not to collaborate! This encourages people to crawl into caves and write huge new features, then “dump” these code-bombs on their peers, at which point the code is unreviewable. Yes, best practices are possible with DVCS, but they’re not encouraged. It makes me nervous about the future of open source development. (Maybe the great liberation is worth it; time will tell.)

Second, how about all those 80% folks working in small Windows development shops? How would we go about deploying DVCS to them?

  • Most DVCS systems don’t run on Windows at all.
  • Most DVCS have no shell or GUI tool integrations; they’re command-line only.
  • Most 80% coders find TortoiseSVN full of new, challenging concepts like “update” and “commit”. They often struggle to use version control at all; are you now going to teach them the difference between “pull” and “update”, between “commit” and “push”? Look me in the eyes and say that with a straight face.
  • Corporations are inherently centralized entities. Not only is their power-structure centralized, but their shared resources are centralized as well.
    • Managers don’t want 20 different private forks of a codebase; they want one codebase that they can monitor all activity on.
    • Cloning a repository is bad for corporate security. Most corporations have an absolute need for access control on their code; sensitive intellectual property in specific parts of the repository is only readable/writeable by certain teams. No DVCS is able to provide fine-grained access control; the entire code history is sitting on local disk.
    • Cloning is often unscalable for corporations. Many companies have huge codebases — repositories which are dozens or even hundreds of gigabytes in size. When a new developer starts out, it’s simply a waste of time (and disk space) to clone a repository that big.

Again, I repeat the irony: Subversion was designed for open source geeks, but the reality is that it’s become much more of a “home run”for corporate development. Subversion is centralized. Subversion runs on Windows, both client and server. Subversion has fine-grained access control. It has an absolutely killer GUI (TortoiseSVN) that makes version control accessible to people who barely know what it is. It integrates with all the GUI IDEs like VS.NET and Eclipse. In short, it’s an absolute perfect fit for the 80%, and it’s why Collabnet is doing so well in supporting this audience.

DVCS and Subversion’s Future

Most Subversion developers are well aware of the cool new ground being broken by DVCS, and there’s already a lot of discussion out there to “evolve” Subversion 2.0 in those directions. However, as Karl Fogel pointed out in a long email, the challenge before us is to keep Subversion simple, while still co-opting many of the features of DVCS. We will not forget about the 80%!

Subversion 1.5 is getting very close to a release candidate, and this fixes the long-standing DVCS criticism that “Subversion merging is awful”. Branching is still a constant-time operation, but you can now repeatedly merge one branch to another without searching history for the exact arguments you need. Subversion automatically keeps track of which changes you’ve merged already, and which still need merging. We even allow cherry-picking of changes. We’ve also got nice interactive conflict resolution now, so you can plug in your favorite Mercurial
merging tool and away you go. A portable patch format is also coming soon.

For Subversion 2.0, a few of us are imagining a centralized system, but with certain decentralized features. We’d like to allow working copies to store “offline commits” and manage “local branches”, which can then be pushed to the central repository when you’re online again. Our prime directive is to keep the UI simple, and avoid the curse of DVCS UI (which often have 40, 50, or even 100 different commands!)

We also plan to centralize our working copy metadata into one place, which will make many client operations much faster. We may also end up stealing Mercurial’s “revlog” repository format as a replacement for the severely I/O bottlenecked FSFS format.

A Last Plea

Allow me to make a plea to all the DVCS fanatics out there: yes, it’s awesome, but please have some perspective! Understand that all tools have tradeoffs and that different teams have different needs. There is no magic bullet for version control. Anyone who argues that DVCS is “the bullet” is either selling you something or utterly forgetting about the 80%. They need to pull their head out of Slashdot and pay attention to the rest of the industry.

Update, 10/18/07: A number of comments indicate that my post should have been clearer in some ways. It was never my intent to say that “Subversion is good enough for everyone” or that “most of the world is too dumb to use DVCS, so don’t use it.” Instead, I’m simply presenting a checklist — a list of obstacles that DVCS needs to overcome in order to be accepted into mainstream corporate software development. I have no doubt that DVCS systems will get there someday, and that will be a great thing. And I’m imploring DVCS evangelists to be aware of these issues, rather than running around thoughtlessly trashing centralized systems. 🙂

The Risks of Distributed Version Control

Thursday, 10 November, 2005

It’s funny how times change. When we started writing Subversion five years ago, CVS was the big evil beast that we were aiming to “subvert”. These days, while Subversion still has a long way to go in performance and features, it has reached critical mass in the open source world. The users@ list has thousands of subscribers and has become self-supporting. Every major free operating system ships with a relatively recent version of subversion. There are several books in the bookstore about Subversion. Major projects like KDE, Apache, and GCC have switched to it, along with dozens of others. When you run across an open source project using Subversion these days, it’s no longer a novelty. It’s become the default safe choice for most new projects.

And now, lo and behold, a whole new generation of version control systems has appeared on the horizon: arch, codeville, monotone, bazaar-ng, svk, git, mercurial. These are the new kids in town — the Distributed Version Control systems — and they aim to unseat the establishment. Yes, you heard right: Subversion is now The Man. I wonder when that happened? 🙂

What makes this new generation of systems fundamentally different is that they take the idea of “disconnected operations” to the extreme. Every user has an entire copy of the repository — 100% of a project’s history — stored on the local computer. Each person is effectively an island onto themselves. Users connect their private repostories together in any way they wish and trade changes like baseball cards; the system automatically tracks which changes you have and which ones you don’t.

There’s something fresh and self-empowering about this model, because it’s a superset of CVS and Subversion’s traditional single-repository model. An open source project can decide that exactly one repository is the Master, and expect all participants to push and pull changes from that master repository as needed. Of course, a project can also organize itself into more interesting shapes: a tree-like hierarchy of repositories, a ring of repositories, or even just a randomly connected graph. It’s tremendously flexible.

Proponents of these systems tend to be a bit fanatical about their “superiority” over today’s centralized systems. Over and over, I hear testimonials like this:

“It’s great! If I want to implement a new feature, I don’t need to have commit access at all. I have my own private copy of the repository, so I can write the whole thing by myself. I commit changes to my private repository as often as I want until it’s all finished. Then I can present the whole thing to everyone else.”

This user is describing a great convenience, but I view it in a slightly darker light. Notice what this user is now able to do: he wants to to crawl off into a cave, work for weeks on a complex feature by himself, then present it as a polished result to the main codebase. And this is exactly the sort of behavior that I think is bad for open source communities. Open source communities need to work together. They need to agree on common goals, discuss designs, and constantly review each other’s work.

In the subversion community, we call the behavior above “dropping a bomb”. It’s considered anti-social and anti-cooperative. Usually the new feature is so big and complex, it’s nearly impossible to review. If it’s hard to review, then it’s hard to accept into the main codebase, hard to maintain code quality, and hard for anyone but the original author to maintain the feature. When this happens, we typically scold the person(s) for not working out in the open.

Good Behavior, on the other hand, involves coming to the community with a design proposal. After some discussion, we ask the developer(s) to either (1) submit a series of patches as work progresses, or (2) give him (or them) a private branch to work on. They needn’t have commit-access to the core code — a branch is all that’s needed. That way the larger community can review the smaller commits as they come in, discuss, give feedback, and keep the developers in the loop. The main goal here is never to be surprised by some huge code change. It keeps the community focused on common goals and aware of each other’s progress.

So while most people say: “isn’t it great that I can fork the whole project without anyone knowing! ” My reaction is, “yikes, why aren’t you working with everyone else? why aren’t you asking for commit access?” This is a problem that is solved socially: projects should actively encourage side-work of small teams and grant access to private branches early and often.

I probably sound like Mr. Anti-Distributed-Version-Control, but I’m really not. It’s definitely cool, and definitely convenient. I just think it’s something that needs to be used very carefully, because the very conveniences it provides also promote fragmentary social behaviors that aren’t healthy for open source communities.

For more on this subject, see this essay by Greg Hudson — it was the writing which originally had my head nodding on this topic. Also relevant is Karl Fogel’s excellent new book, Producing Open Source Software. It’s all about managing and promoting healthy open source developer communities.