A Brief Introduction to Distributed Version Control

Last night at SLUUG, I have a talk on distributed source control tools. It was quite introductory, but the notes (below) may still be helpful. These notes were on a handout at the talk, as usual I didn’t use slides.

Unfortunately I didn’t get an audio recording of this talk, so no transcript either.

About 30 people were in attendance. Nearly 100% were familiar with CVS and SVN, and perhaps 20% with other tools (ClearCase, SourceSafe, and others). Only 4 had ever used branch/merge in any project or tool! Continue reading “A Brief Introduction to Distributed Version Control”

Fix It So It Stays Fixed: An Example

A recurring theme in our projects is a desire to “fix things so they stay fixed”. I have in mind writing about that idea in detail later, but for now I’ll start with an example of how to do so.

A common and useful thing to do with disk storage space is to keep old copies of important data around. For example, we might keep the last 15 days of nightly backups of a database. This is easy to set up and helpful to have around. Unfortunately, sooner or later we discover that the process of copying a new backup to a disk managed this way, fails because the disk is full: the ongoing growth of the backup files reached a point where 15 old ones plus a new one does not fit.

How will we fix this?

Idea #1: Reduce 15 days to 10 days. Great, now it doesn’t fail for a while… but eventually it fills up with 10 of the now-larger files. It didn’t stay fixed.

Idea #2: Buy a bigger disk (maybe a huge disk, if money is abundant). A while later, it fills up. It didn’t stay fixed.

Idea #3: Set up an automated monitoring system, so that someone is informed when the disk is getting close to full. This is a big improvement, because hopefully someone will notice the monitor message and adjust it before it fails. But to me, it is not “fixed to stay fixed” because I will have to pay someone to adjust it repeatedly over time.

Idea #4: Sign up for Amazon S3, so we can store an unlimited number of files, of an unlimited size. Thus will probably stay fixed from a technical point of view, but it is highly broken in the sense that you get a larger and larger S3 invoice, growing without limit. To me, this means it didn’t stay fixed.

Idea #5: Dynamically decide how many old backups to keep.

The core problem with the common design I described above is the fixed N of old files to keep. The solution is to make that number dynamic; here is one way to do that:

  • Make the old-file-deletion process  look at the size of the most recent few files, and estimate the “max” of those plus some percentage as the likely maximum size of a new file.
  • Compare that to the free space.
  • If there is not enough free space, delete the oldest backup
  • Loop back and try again.
  • Be careful with error checking, and put in some lower limit of how many files to preserve (perhaps 2 or 3).

Like all mechanisms, this one has limits. Eventually the daily file size may grow so large that it’s no longer possible to keep 1 or more copies on the disk; so in this sense it does not stay fixed; but it does stay fixed all the way up to the limit of the hardware, with no human intervention.

Yet Another Python Success Story

Is it OK to use programming language X in a production enterprise application? Or are fear, uncertainly, and doubt holding you back? Public “success stories” might make it more acceptable for you to do so in your environment.

In that spirit I offer our story of a production Python deployment at an Oasis Digital customer (without names or details, to protect their privacy). There are many other success stories at Python.org. In this project, the client and application server (the bulk of the system) are written in Delphi (which was much more popular when the project started, than it is today). A major subsystem (roughly 1/3rd of the overall system) is written in Python. It consists of a set modules that parse textual data from a large number of varied formats, into a common schema, another set of modules to apply (frequently adjusted) business rules, and a third set of miscellaneous modules. These are all used in background data processing, not part of a client application or a server handling requests from a client application. These modules interact with the rest of the system primarily through state stored in a database. I generally recommend against the database integration style between separate applications, but it works well in this scenario within modules of the same application, built and maintained concurrently by the same team.

The Good

We chose Python for this subsystem for a variety of reasons. First, its built-in features are well suited to the text processing task at hand. Python’s “batteries included” have generally avoided the need to find or implement add-on text processing tools (which would have been necessary in Delphi); thus a programmer needs to know and use “just” what’s in the Python box, with few external libraries to consider.

Second, Python’s built in features and compact syntax have shortened the programming time considerably, in our estimation, than would have been otherwise. It takes relatively few lines of Python to get the job done. We have many lines of Python, and would have had many more lines had we used a lower-level language. (Of course lines of code is not everything; it’s possible to come up with dense, bad code. As a general rule, though, a language in which you can express what you need to express more succinctly, is better.)

Third, Python’s interpreted nature keeps the edit-test cycles short, further speeding development. This development speed issue is especially important given the niche this project occupies, in which data format changes and rule changes sometimes arrive with no notice: new data arrives, and part of the application does not work until it is enhanced to handle the new data. Extensive use of automated unit and integration tests (many hundreds of test cases) effectively prevents the interpreted operation from causing trivial runtime errors (type errors, syntax errors).

Of course, some other languages with similar compact syntax, included libraries, and high level features, would have worked equally well. At the time(5+ years ago), though, Python appeared to have the most momentum, other than Perl. Perl would have been a good choice also from a technical point of view, but it had an (unwarranted) reputation of being hard to maintain, which I didn’t want to have to overcome.

The Bad

There are downsides to our two-language approach. The first is somewhat Python specific, and really not a big deal: Python is slow. Its primitives are fast, but when you write considerable Python code to do something, it does that something at a rather leisurely pace compared to Delphi or Java or C++, using a lot more CPU along the way. The practical impact of this has been limited, because the bottleneck on this system is not the Python code, it is the database; but still, this has been inconvenient, and has required our customer to deploy multiple machines for this subsystem where one would have been sufficient with a more efficient language/runtime. Doing so is not particularly expensive, though, and adds a measure of reliability, so we haven’t had a need to speed things up with Psyco, C modules, etc.

The other issue is more serious. It is created by the large gap in language style and features between Delphi vs. Python in particular, and low-level vs. scripting languages in general. (Those of you unfamiliar with Delphi may be thinking this is because Delphi is some hideous VB-like toy. Wrong. Delphi is a somewhat C++-like or Java-like language, statically compiled, fast, and sadly burdened with a Pascal syntax.) Personally, this gap bothers me not at all. I’ve written production code in assembly, C, C++, Delphi, Java, Python, Ruby, Javascript, Lua, a bit of Prolog, and others I forget right now; I am happy to use one language in the morning and another radically different one in the afternoon.

I have discovered that most developers are not like me, though. Most Delphi developers are notably uninterested in Python and vice versa. As a result, our project team has ended up divided along the same lines as the software, with some cross training but relatively little production development crossover between the developers working in each languages. This is an obstacle to any developer taking end-to-end responsibility for features or issues that span the languages, and also an obstacle to hiring.

Python itself is not much of an obstacle to hiring: while there are far fewer Python programmers than Java (for example) programmers, there are also far fewer Python jobs than Java jobs.

The Verdict

In spite of the downsides discussed above, overall it has been a “win”, technically, to use two languages (each well suited to part of the application) in this project. More importantly, I am also confident this choice has been a win for our customer: they got a system delivered faster, and at lower cost, than they otherwise would have. They used every bit of speed we could deliver, to win business from their competitors.

However, the world has improved a lot since this decision was made; today I could probably choose a single language / toolset which meets all the needs sufficiently well, and thus avoid the downsides of the two-language solution. Alternatively, if starting today I might build the infrastructure for all subsystems in the same base language, with hooks to use Lua or Javascript scripting to accommodate the need for rapid runtime logic changes. It’s even possible that we would port the existing code to another language in the future – which would not make the original decision a mistake.

Fourteen Tools for a Productive Distributed Team

A geographically distributed software development team (“distributed team”, for short) is simply one where developers don’t work in close physical proximity (within a few hundred feet). In such a time you interact mostly via electronic means.

To some readers a distributed team will sound like an obviously ridiculous idea, while to others it will sound quite normal. A great many companies are distributed nowadays, as are essentially all open source projects, including the largest and most successful such projects. (Does open source seem like a toy to you? Not “real”? Consider this: in your real project you need large amounts of money to get everyone to work together to build something. Open source projects somehow accomplish this without the carrot and stick of money. Which is more impressive?)

Distributed teams appear to be most common in small firms, but certainly aren’t rare elsewhere. At some large firms, there are teams that work at the same campus or even in the same building, yet are inconveniently spread out enough that they are borderline distributed.

At Oasis Digital we operate mostly as a geographically distributed team, or rather as a set of teams (for various projects). We are scattered around our city, our country, and in a few cases, globally. We’ve learned a lot about how to succeed this way, summarized as 14 tools for a productive distributed team:

(1) Online meeting  / telephone. Effective online meeting tools are available inexpensively or sometimes free; telephone calls, including mobile, long distance, and conference calls are free or inexpensive. Either way, don’t hesitate to spend time together and live discussion, with at least voice and ideally video. Especially the early stages of a project, this is by far the closest match to in-person interaction, of all the tools in this list. At Oasis Digital, we typically use a regular weekly discussions as a baseline, with more discussions of all kinds as needed.  (Frequently at the beginning of a project, less frequently over tim.)

(2) Instant Messaging. Beware that your time is valuable though; don’t type out a long conversation, when it gets complex talk live instead. The key value in IM is as a substitute for the awareness of who is available and working, that would be obvious in the same office.

(3) Email is especially helpful to create a thread, a trail of key decisions. I suggest summarizing key decisions, especially when they span between different teams, in a (hopefully concise) email – even if the decision was reached by meeting, phone, IM, in-person, etc. As above, beware that email can consume unlimited amount of time, and don’t hesitate to  switch to live discussion if it gets complex.

(4) Airplane / Car / Train. The most powerful tool to help a distributed team work well is occasional in-person interaction.

(5) Documents. Communicate a complex idea, especially one that will need to be explained again and again, clearly in a document. (An aside on “template” documents: I’ve found them to be of limited value, though we use them when needed to meet customer requirements.)

(6) Screen Sharing. Using Remote Desktop, VNC, and other tools, it’s possible to do “pair programming” from half a world away. In our projects we do a only a little of this, but a few hours of pair programming on a tricky area can work wonders. Screen sharing is also extremely useful for demonstrating new features, reproducing bugs, working with customers, and much more.

(7) Screencasts are video/audio recordings of the computer screen and a person talking; they are very useful for explaining a feature or module to another developer. Recording then viewing a screencast is not as effective as sitting alongside someone explaining in real-time, but it has the tremendous advantage of re-playability. A series of (high quality) screencasts explaining important parts of a system will get new team members up to speed quickly.

(8) Audio recordings. Record an audio explanation of a feature or bug, while walking around or driving, on an inexpensive hand-held digital voice recorder. Then send it to another developer as is, or have it transcribed. Again, it’s not quite as effective or high-fidelity (of either the audio or the ideas) as an in-person explanation, but can be replayed, sent to multiple recipients, etc. A caveat here is that people read much faster than they listen, so if you have something long-winded (or long-lived, important) to say, have it transcribed then edit it. You may find, as I have, that it’s far easier to compose a long detailed explanation this way, than via direct typing.

(9) Screen Shots. Don’t just tell; show. Show another developer on your team what you mean by taking a screenshot them marking it up.  Unlike a screen recording, it’s easy to go through a set of screenshots and update just one of them in the middle if things change.

(10) Mockups. A distributed team leaves more room for misunderstanding of desired results. Counteract this by building a mockup (paper, Excel, etc.) of what you want.

(11) Issue / Bug Tracking. If you don’t have an issue tracker for your team, stop reading this now and go get one: Mantis, Trac, Jira, there are hundreds to choose from. (Go ahead. I’ll wait.) A common approach for collocated teams is to use a tracking system loosely, updating it occasionally. In this usage model, tracker items act mostly as high level token for more detailed information exchanged in conversations. This approach can be made to work on a distributed team, but in my experience it is unwise. Instead, get every issue in to your tracker and keep it up to date aggressively. Don’t let the sun go down with your issue tracker mismatching reality. Assume that others on your team will check the issue tracker then make important decisions relying on the data therein. If they see something that was fixed 3 days ago, but still marked as broken, bad decisions can easily result. Anyone on the team should be able to see the essential status of every current issue in the tracker, not by a time consuming process of asking someone about each item.

(12) Source Control. You’re already using this, of course. For the future, consider a distributed source control tool (bzr, git, svk, Mercurial, etc.) for your distributed team – if you’ve only used centralized tools (CVS, SVN, ClearCase, etc.) you’ll be surprised how helpful it is to have the project version history available locally, among other benefits. (See Linus explain distributed source control.)

(13) Code Review. Some collocated teams enjoy social camaraderie, but squander the benefits of proximity by working in technical isolation. Your distributed team can outperform them easily: avoid technical isolation. Read each other’s code, comment on it, learn from each other.

(14) Automated Builds. Surely there is hardly anyone left by 2007 who isn’t using an automated, continuous build process on a server somewhere. The value of this is amplified in a distributed team, because the stream of successful builds (we fire off a build after each commit) helps keep everyone in sync.

A related but separate question is where distributed team members work: in their homes, in offices outside their homes, in public spaces, etc. I believe the best answer is a blend of these; I mentioned hiding in a cafe to get some work done before, and I’ll write more about the Where question in a later post.

Another related question is whether any of this is even a good idea; stayed tuned to read more on this as well.

Help! My Hierarchy is Slow – Faster Hierarchies with Nested Sets

A great many applications, including many that I’ve worked on, have a hierarchy of things: of parts, of people, of organizations, etc. The way most of us represent such hierarchy is with the first thing that generally comes to mind: make each Widget have a parent Widget, with a table like so:

create table widget (widget_id int, parent_widget_id int, other_fields_here);

This representation is called an “adjacency list”, and is simple and easy. You can readily build a tool to manipulate a hierarchy stored this way. Many off the shelf visual components, for both client-side and web applications, know how to manipulate hierarchies represented this way. Some reporting tools know how to report on hierarchies represented this way.

However, for answering common questions like “who all is under person X in the hierarchy”, the adjacency list approach is unwieldy and slow.

There are various other approaches to representing a hierarchy, most of them discussed in detail in Joe Celko’s articles and books, prominently in the book Trees and Hierarchies in SQL for Smarties. If you work with SQL and hierarchies, buy this book now.

One approach Celko is especially fond of is the “nested set” representation. You can read about it online here and here.

Of course, changing an entire application to use nested sets might be a very big deal in a mature application. That’s OK; in most cases we can get much of the benefit by building a nested-sets “cache” of the share of the hierarchy, with a table like so:

create table widget_hier_cache (widget_id int, left int, right int);

Each time the hierarchy has changed, or before each time we need to run complex queries, delete the rows in this cache table and repopulate them based on the current canonical adjacency data. Celko offers SQL code in his book to do that, which could be translated to work in the stored procedure language of the DBMS at hand. But what about DBMSs that don’t offer stored procedures, such as lightweight local databases (SQLite), MySQL, etc.? The translation must be done in application code instead.

I wrote such code in Delphi a while back, in the process of getting a full understanding of this problem; I’ve cleaned it up and now offer code for download here (DelphiAdjacencyNestedSets.zip), under an open source license (MIT license – use it all you want). I tested this today with Delphi 2007 Win32, but it should work fine at least back to Delphi 7. As far as I can tell with some searching, this is the only Delphi code for translating adjacency to nested sets available on the internet. This code doesn’t know about databases – it is a module to which you feed adjacency data, and from which you get back nested set data. It includes DUnit test cases.

I’ve also put the code on github, for easy browsing and forking.

(Update in August 2007: a new version (DelphiAdjacencyNestedSets2.zip) optionally tolerates “orphan” nodes and forests. Update in January 2008: a newer version (DelphiAdjacencyNestedSets3.zip) propagates an integer value down the hierarchy; it is named Tag as a nod to the .Tag property on VCL components. Both of these newer versions were tested with Delphi 2006/2007 also.)

The essence of the translation is a depth-first traversal of the hierarchy, and of course this can be easily implemented in other languages; the Delphi code is easy to understand, so don’t be afraid to take a look even if you need some Java or C# or PHP etc. I also stumbled across this PHP nested sets implementation, which offers a set of functions to maintain (insert, update, etc.) a hierarchy stored as nested sets, rather than only translate from adjacency to nested sets.

Another useful way to represent a hierarchy for fast querying is with a transitive closure table. I’ll write this up in a future post; it turns out to be especially useful (and necessary) to make arbitrary hierarchies work in the Mondrian OLAP server.

Pipe RGB data to ffmpeg

A while back I asked on the ffmpeg mailing list how to pipe RGB data in to ffmpeg. I described it as follows:

in my code I am building video frames, 720x480x24bit. I have in mind generating a large number of these, as long as a full DVD worth at 30fps, then using ffmpeg (followed by dvdauthor) to encode them in to MPEG2 for DVD usage.

There were a few replies, but no definitive answer. With considerable experimentation, I got it to work. It turns out that (as far as I can tell) ffmpeg does not have the ability to accept piped in RGB frames. It will however accept piped in data in its “yuv4mpegpipe” format. With some searching and reading I found that this is roughly akin to the format of raw DV video; each frame consists of a header something like this:

YUV4MPEG2 W%d H%d F%d:%d Ip A0:0 C420mpeg2 XYSCSS=420MPEG2

… then an LF character, then data for the the Y, U, and V “planes”. The Y data is full resolution, while the U and Y are half-resolution (this is called “420” in the video world). These planes are uncompressed, one byte per pixel. All of my past work with computer video (going back to Commodore 64s and Apple IIs) has arranged all of the bits for each pixel within a few bytes of each other; this format (with all the Y data for the whole frame, then all the U data, then all the V data) is starkly different.

The essential problem remaining was how to convert RGB to YUV. Happily there are plenty of online references for this. Unhappily there are few fast implementations, and a naive implementation will be very slow. I solved this problem by finding and hiring an expert in low-level data processing with MMX, SSE2, etc. instructions. (I am not in a position to publish that code here.)

In retrospect, though, there are routines included in Intel’s “Integrated Performance Primitives” library which perform this transformation in a highly optimized way. IPP is a bargain: for only a few hundred dollars you get a wealth of high optimized ready-to-use library routines for signal processing.

The ffmpeg piping solution consists, therefore, of:

  1. A module which generated frames in RGB format, to contain whatever contents your application requires.
  2. A module to very quickly convert these to YUV in yuv4mpegpipe format (write your own, or use routines in IPP, for the RGB->YUV420 part).
  3. Pipe this data stream to ffmpeg with stdin; ffmpeg is invoked something like this: ffmpeg -y -f yuv4mpegpipe -i – -i audio.mp3 -target ntsc-dvd -aspect 4:3 foo.mpg

By using a multicore CPU and threads, this whole process can be made to happen in real time or better (i.e., one second of “wall clock” processing time, for one second of finished MPEG2 video). The resulting MPEG2 file can be used with a DVD authoring application to produce a ready-to-burn DVD ISO image.

Update: the data format above is published here as part of the mjpegtools man pages.