Distributed Version Control for the Other 80%

Ben Collins-Sussman, one of the key developers behind Subversion, argues in Version Control and the 80% that distributed version control will remain a niche interest, and will not move in to the mainstream (as his favorite tool certainly has). He has a number of good reasons to back up this thesis.

I think he’s wrong. The “other 80%” are not profoundly stupid imbeciles who could never grasp the point of DVCS. Rather they are, generally, working developers with important projects underway, for which they need tools to work well out of the box when used in the default way. DVCS tools can certainly do that. More specifically, the list of reasons he gives why DVCS won’t become broadly popular, should be read more as a to-do list of how to improve DVCS so they can become broadly popular. What the DVCS community needs is at least one DVCS which:

  • Installs easily on Windows, with a single installer, including diff/merge tool and GUI
  • Includes a very good standalone GUI
  • Secures client/server (peer-to-peer) communicate by default, without user setup of SSH, HTTPS, etc.
  • Integrates well with Eclipse
  • Integrates well with Visual Studio
  • Integrates well with Explorer (i.e. TortoiseBlah)
  • Integrates, begrudgingly, with Microsoft’s SCC API so as to support the many tools which can use an SCC API plugin
  • Includes permission controls for server repositories, including good tools for configuration thereof
  • Automates sharing of branches trivially (some already do this, some less so)Automates the common ways of using a DVCS, most importantly the usage model in which the DVCS is used as a better SVN with full offline capabilities
  • Guides users, if so configured, gently back toward a small number (one, in some cases) of main central branches, which is what most projects want
  • Communicates clearly what kind of project it can support well (most of them) and what kind it won’t support well (those with an enormous pile of huge files, of which most users only need a few)

(SVN itself is not without flaws. Ben lists some of them as areas in which improvements are coming, while others (such as, in my opinion, using the file namespace for branches and tags) are likely here to stay.)

In the next few years we will probably see one or more DVCS tools gain most or all of the features above. With addresses, an important truth will be more obvious: distributed source control is, in most ways, a superset of centralized source control, and the latter can be thought of as a special case of the former.

That said, though, I think the DVCS movement will lose a bit of steam when SVN ships better merge support, if that merge support is sufficiently good. The merge “features” are certainly the biggest issue we have here with SVN.

Growing a Language, by Guy Steele

This is an oldie-but-goodie: Guy Steele’s “Growing a Language” talk from OOPSLA 1998.

It is amazing to me that Guy, whose is something of a legend in language design, and who thinks so clearly about what makes a good language, was also key in designing Java. Java has been extremely slow to grow in the sense described in this talk, because for many years Sun resisted such growth. Only the rise of C# and the growing popularity of dynamic languages generated enough pressure to get Java unstuck… and in the last few year Java has become somewhat growable in the sense Guy describes.

Fix It So It Stays Fixed: An Example

A recurring theme in our projects is a desire to “fix things so they stay fixed”. I have in mind writing about that idea in detail later, but for now I’ll start with an example of how to do so.

A common and useful thing to do with disk storage space is to keep old copies of important data around. For example, we might keep the last 15 days of nightly backups of a database. This is easy to set up and helpful to have around. Unfortunately, sooner or later we discover that the process of copying a new backup to a disk managed this way, fails because the disk is full: the ongoing growth of the backup files reached a point where 15 old ones plus a new one does not fit.

How will we fix this?

Idea #1: Reduce 15 days to 10 days. Great, now it doesn’t fail for a while… but eventually it fills up with 10 of the now-larger files. It didn’t stay fixed.

Idea #2: Buy a bigger disk (maybe a huge disk, if money is abundant). A while later, it fills up. It didn’t stay fixed.

Idea #3: Set up an automated monitoring system, so that someone is informed when the disk is getting close to full. This is a big improvement, because hopefully someone will notice the monitor message and adjust it before it fails. But to me, it is not “fixed to stay fixed” because I will have to pay someone to adjust it repeatedly over time.

Idea #4: Sign up for Amazon S3, so we can store an unlimited number of files, of an unlimited size. Thus will probably stay fixed from a technical point of view, but it is highly broken in the sense that you get a larger and larger S3 invoice, growing without limit. To me, this means it didn’t stay fixed.

Idea #5: Dynamically decide how many old backups to keep.

The core problem with the common design I described above is the fixed N of old files to keep. The solution is to make that number dynamic; here is one way to do that:

  • Make the old-file-deletion process  look at the size of the most recent few files, and estimate the “max” of those plus some percentage as the likely maximum size of a new file.
  • Compare that to the free space.
  • If there is not enough free space, delete the oldest backup
  • Loop back and try again.
  • Be careful with error checking, and put in some lower limit of how many files to preserve (perhaps 2 or 3).

Like all mechanisms, this one has limits. Eventually the daily file size may grow so large that it’s no longer possible to keep 1 or more copies on the disk; so in this sense it does not stay fixed; but it does stay fixed all the way up to the limit of the hardware, with no human intervention.

Fourteen Tools for a Productive Distributed Team

A geographically distributed software development team (“distributed team”, for short) is simply one where developers don’t work in close physical proximity (within a few hundred feet). In such a time you interact mostly via electronic means.

To some readers a distributed team will sound like an obviously ridiculous idea, while to others it will sound quite normal. A great many companies are distributed nowadays, as are essentially all open source projects, including the largest and most successful such projects. (Does open source seem like a toy to you? Not “real”? Consider this: in your real project you need large amounts of money to get everyone to work together to build something. Open source projects somehow accomplish this without the carrot and stick of money. Which is more impressive?)

Distributed teams appear to be most common in small firms, but certainly aren’t rare elsewhere. At some large firms, there are teams that work at the same campus or even in the same building, yet are inconveniently spread out enough that they are borderline distributed.

At Oasis Digital we operate mostly as a geographically distributed team, or rather as a set of teams (for various projects). We are scattered around our city, our country, and in a few cases, globally. We’ve learned a lot about how to succeed this way, summarized as 14 tools for a productive distributed team:

(1) Online meeting  / telephone. Effective online meeting tools are available inexpensively or sometimes free; telephone calls, including mobile, long distance, and conference calls are free or inexpensive. Either way, don’t hesitate to spend time together and live discussion, with at least voice and ideally video. Especially the early stages of a project, this is by far the closest match to in-person interaction, of all the tools in this list. At Oasis Digital, we typically use a regular weekly discussions as a baseline, with more discussions of all kinds as needed.  (Frequently at the beginning of a project, less frequently over tim.)

(2) Instant Messaging. Beware that your time is valuable though; don’t type out a long conversation, when it gets complex talk live instead. The key value in IM is as a substitute for the awareness of who is available and working, that would be obvious in the same office.

(3) Email is especially helpful to create a thread, a trail of key decisions. I suggest summarizing key decisions, especially when they span between different teams, in a (hopefully concise) email – even if the decision was reached by meeting, phone, IM, in-person, etc. As above, beware that email can consume unlimited amount of time, and don’t hesitate to  switch to live discussion if it gets complex.

(4) Airplane / Car / Train. The most powerful tool to help a distributed team work well is occasional in-person interaction.

(5) Documents. Communicate a complex idea, especially one that will need to be explained again and again, clearly in a document. (An aside on “template” documents: I’ve found them to be of limited value, though we use them when needed to meet customer requirements.)

(6) Screen Sharing. Using Remote Desktop, VNC, and other tools, it’s possible to do “pair programming” from half a world away. In our projects we do a only a little of this, but a few hours of pair programming on a tricky area can work wonders. Screen sharing is also extremely useful for demonstrating new features, reproducing bugs, working with customers, and much more.

(7) Screencasts are video/audio recordings of the computer screen and a person talking; they are very useful for explaining a feature or module to another developer. Recording then viewing a screencast is not as effective as sitting alongside someone explaining in real-time, but it has the tremendous advantage of re-playability. A series of (high quality) screencasts explaining important parts of a system will get new team members up to speed quickly.

(8) Audio recordings. Record an audio explanation of a feature or bug, while walking around or driving, on an inexpensive hand-held digital voice recorder. Then send it to another developer as is, or have it transcribed. Again, it’s not quite as effective or high-fidelity (of either the audio or the ideas) as an in-person explanation, but can be replayed, sent to multiple recipients, etc. A caveat here is that people read much faster than they listen, so if you have something long-winded (or long-lived, important) to say, have it transcribed then edit it. You may find, as I have, that it’s far easier to compose a long detailed explanation this way, than via direct typing.

(9) Screen Shots. Don’t just tell; show. Show another developer on your team what you mean by taking a screenshot them marking it up.  Unlike a screen recording, it’s easy to go through a set of screenshots and update just one of them in the middle if things change.

(10) Mockups. A distributed team leaves more room for misunderstanding of desired results. Counteract this by building a mockup (paper, Excel, etc.) of what you want.

(11) Issue / Bug Tracking. If you don’t have an issue tracker for your team, stop reading this now and go get one: Mantis, Trac, Jira, there are hundreds to choose from. (Go ahead. I’ll wait.) A common approach for collocated teams is to use a tracking system loosely, updating it occasionally. In this usage model, tracker items act mostly as high level token for more detailed information exchanged in conversations. This approach can be made to work on a distributed team, but in my experience it is unwise. Instead, get every issue in to your tracker and keep it up to date aggressively. Don’t let the sun go down with your issue tracker mismatching reality. Assume that others on your team will check the issue tracker then make important decisions relying on the data therein. If they see something that was fixed 3 days ago, but still marked as broken, bad decisions can easily result. Anyone on the team should be able to see the essential status of every current issue in the tracker, not by a time consuming process of asking someone about each item.

(12) Source Control. You’re already using this, of course. For the future, consider a distributed source control tool (bzr, git, svk, Mercurial, etc.) for your distributed team – if you’ve only used centralized tools (CVS, SVN, ClearCase, etc.) you’ll be surprised how helpful it is to have the project version history available locally, among other benefits. (See Linus explain distributed source control.)

(13) Code Review. Some collocated teams enjoy social camaraderie, but squander the benefits of proximity by working in technical isolation. Your distributed team can outperform them easily: avoid technical isolation. Read each other’s code, comment on it, learn from each other.

(14) Automated Builds. Surely there is hardly anyone left by 2007 who isn’t using an automated, continuous build process on a server somewhere. The value of this is amplified in a distributed team, because the stream of successful builds (we fire off a build after each commit) helps keep everyone in sync.

A related but separate question is where distributed team members work: in their homes, in offices outside their homes, in public spaces, etc. I believe the best answer is a blend of these; I mentioned hiding in a cafe to get some work done before, and I’ll write more about the Where question in a later post.

Another related question is whether any of this is even a good idea; stayed tuned to read more on this as well.

My name is Kyle, and I’m an Infoholic

I recently read Tim Ferriss’s book The Four Hour Work Week, colloquially called 4HWW. The book is short, dense with ideas, and easily worth the $12 price. I recommend the book in spite of:

  • Questions about the veracity of Ferriss’s claimed accomplishments
  • Criticisms that some of his techniques are not as broadly applicable as he makes them sound
  • The fact that the author apparently fell for a bogus chain-letter email and reprinted it on page 284. Ooops – how embarrassing!
  • I’d guess he’s spending more like 80 hours per week promoting his book over the last few months, with many media appearances, interviews, etc.

Among his main points (outsource more, delegate more, sell products rather than services, travel, etc.), the key idea that stood out for me is the “low-information diet”: read less, watch less, surf the web less. This is nothing new of course (I even touched on it myself in an earlier post), but Ferriss makes a compelling case.

Unfortunately, upon self-examination the truth hurts:

  • I read too many books, even though I’ve gotten rid of many recently
  • I read too many magazines.
  • I read too many web sites.
  • I subscribe to too many RSS/Atom feeds.
  • I check email too often.

In my defense, I also somehow write a lot of software, solve many customer problems, and much of the information I consume is at least tangentially related to those sources of value. I read quickly, and I don’t watch television, so this excessive consumption is not as time consuming as it could be.

Still, I need something closer to Ferriss’s low-information diet. I don’t have the guts to go cold turkey, and part of the service we offer to our customers is fast response to problems, so I won’t go as far as he suggests. I will spend less time consuming input and more time producing output.

Update in 2009: This remains an ongoing struggle, but I do quite often manage entire days of producing most of the day and consuming only in short breaks.

YouTube Scalability Talk

Cuong Do of YouTube / Google recently gave a Google Tech Talk on scalability.

I found it interesting in light of my own comments on YouTube’s 45 TB a while back.

Here are my notes from his talk, a mix of what he said and my commentary:

In the summer of 2006, they grew from 30 million pages per day to 100 million pages per day, in a 4 month period. (Wow! In most organizations, it takes nearly 4 months to pick out, order, install, and set up a few servers.)

YouTube uses Apache for FastCGI serving. (I wonder if things would have been easier for them had they chosen nginx, which is apparently wonderful for FastCGI and less problematic than Lighttpd)

YouTube is coded mostly in Python. Why? “Development speed critical”.

They use psyco, Python -> C compiler, and also C extensions, for performance critical work.

They use Lighttpd for serving the video itself, for a big improvement over Apache.

Each video hosted by a “mini cluster”, which is a set of machine with the same content. This is a simple way to provide headroom (slack), so that a machine can be taken down for maintenance (or can fail) without affecting users. It also provides a form of backup.

The most popular videos are on a CDN (Content Distribution Network) – they use external CDNs and well as Google’s CDN. Requests to their own machines are therefore tail-heavy (in the “Long Tail” sense), because the head codes to the CDN instead.

Because of the tail-heavy load, random disks seeks are especially important (perhaps more important than caching?).

YouTube uses simple, cheap, commodity Hardware. The more expensive the hardware, the more expensive everything else gets (support, etc.). Maintenance is mostly done with rsync, SSH, other simple, common tools.
The fun is not over: Cuong showed a recent email titled “3 days of video storage left”. There is constant work to keep up with the growth.

Thumbnails turn out to be surprisingly hard to serve efficiently. Because there, on average, 4 thumbnails per video and many thumbnails per pages, the overall number of thumbnails per second is enormous. They use a separate group of machines to serve thumbnails, with extensive caching and OS tuning specific to this load.

YouTube was bit by a “too many files in one dir” limit: at one point they could accept no more uploads (!!) because of this. The first fix was the usual one: split the files across many directories, and switch to another file system better suited for many small files.

Cuong joked about “The Windows approach of scaling: restart everything”

Lighttpd turned out to be poor for serving the thumbnails, because its main loop is a bottleneck to load files from disk; they addressed this by modifying Lighttpd to add worker threads to read from disk. This was good but still not good enough, with one thumbnail per file, because the enormous number of files was terribly slow to work with (imagine tarring up many million files).

Their new solution for thumbnails is to use Google’s BigTable, which provides high performance for a large number of rows, fault tolerance, caching, etc. This is a nice (and rare?) example of actual synergy in an acquisition.

YouTube uses MySQL to store metadata. Early on they hit a Linux kernel issue which prioritized the page cache higher than app data, it swapped out the app data, totally overwhelming the system. They recovered from this by removing the swap partition (while live!). This worked.

YouTube uses Memcached.

To scale out the database, they first used MySQL replication. Like everyone else that goes down this path, they eventually reach a point where replicating the writes to all the DBs, uses up all the capacity of the slaves. They also hit a issue with threading and replication, which they worked around with a very clever “cache primer thread” working a second or so ahead of the replication thread, prefetching the data it would need.

As the replicate-one-DB approach faltered, they resorted to various desperate measures, such as splitting the video watching in to a separate set of replicas, intentionally allowing the non-video-serving parts of YouTube to perform badly so as to focus on serving videos.

Their initial MySQL DB server configuration had 10 disks in a RAID10. This does not work very well, because the DB/OS can’t take advantage of the multiple disks in parallel. They moved to a set of RAID1s, appended together. In my experience, this is better, but still not great. An approach that usually works even better is to intentionally split different data on to different RAIDs: for example, a RAID for the OS / application, a RAID for the DB logs, one or more RAIDs for the DB table (uses “tablespaces” to get your #1 busiest table on separate spindles from your #2 busiest table), one or more RAID for index, etc. Big-iron Oracle installation sometimes take this approach to extremes; the same thing can be done with free DBs on free OSs also.

In spite of all these effort, they reached a point where replication of one large DB was no longer able to keep up. Like everyone else, they figured out that the solution database partitioning in to “shards”. This spread reads and writes in to many different databases (on different servers) that are not all running each other’s writes. The result is a large performance boost, better cache locality, etc. YouTube reduced their total DB hardware by 30% in the process.

It is important to divide users across shards by a controllable lookup mechanism, not only by a hash of the username/ID/whatever, so that you can rebalance shards incrementally.

An interesting DMCA issue: YouTube complies with takedown requests; but sometimes the videos are cached way out on the “edge” of the network (their caches, and other people’s caches), so its hard to get a video to disappear globally right away. This sometimes angers content owners.

Early on, YouTube leased their hardware.