Great Developers, Projects That Sound Boring

I’ve been a fan of Joel Spolsky for years, though I haven’t agreed with everything he’s written, and even mocked him a bit. Joel has written at length on his web site and in print about attracting the best developers, and one aspect of that has bothered me:

How do you attract top developers to work on something that sounds rather dull, like a bug tracking application? It mostly shuffles data back and forth between screens and reports and database tables – far too boring for top developers.

Of course that’s an exaggeration, but a relevant one: at Oasis Digital much of our work is on enterprise business process automation, database-centric applications, and could likewise be described casually (though not accurately) as “just” shuffling data between screens and tables. I worry that our work will not sound interesting to prospective hires.

This week at the Business of Software conference I got a chance to ask (confront?) Joel about this. He offered a great, four-part answer, which I present here with my own additions mixed in. I don’t have careful notes about which bits came from Joel, so you are welcome to give him all the credit and me all the blame.

1) There is Interesting Technology Inside

Even in an application which, at first glance, just shuffles data around, there can turn out to be a lot of very interesting work inside. This is true of FogBugz and it is true of our work at Oasis Digital as well. Here are some examples of interesting work here, all of it inside dull-sounding applications:

  • Process metadata and generate code and GUI elements. Top developers certainly are those who solve a family of problems with generic code and metadata, rather than tediously one at a time.
  • Process large hierarchies efficiently using Celko’s nested sets representation technique. Top developers are all about using better data structures.
  • Custom GUI components to provide a drag-drop, direct manipulation approach to visualize and modifying data. The results has both a high “wow” factor, and is genuinely useful – a willing combination for top developers.
  • Integrate a Prolog-based rules mechanism to provide a vital algorithm in one page of code, that would have required countless pages of code and hundreds-to-thousands of hours of work to do otherwise. Using a radically different language to solve a problem with a small fraction of the effort… exactly the sort of thing a top developer wants to do.
  • Generic data replication mechanism: building our own was certainly more interesting work than adoptions one off the shelf.
  • Learn how OLAP works, implement an OLAP ETL process.
  • … I could list many more examples

2) One Level Down the Stack

Fog Creek aims to make compellingly good software because that is how you outcompete established competitors in a commercial shrinkwrap market. Oasis Digital likewise has this aim for a different reason: it is our intended niche. We aim to differentiate ourselves from “yet another outsourced dev company” by building unexpectedly good software. We don’t want customers who will be happy with the results available from the typical development firm; we want customers who are playing to win.

To meet either goal, it is sometimes necessary to work at one level of abstraction lower than would otherwise be necessarily. Joel’s examples were their own data grid and their own AJAX library. Some of our examples are listed above.

This kind of work, further in to the details, is generally more compelling to top developers.

Caveat: Don’t do this very often. If you want to ship software anytime soon, you need to mostly use off the shelf libraries that already work. Don’t build a data-grid, for example, unless you really need something that you can’t find in any off the shelf products. We don’t have our own data-grid; we use (among other things) the excellent grid products from Developer Express.

3) Problem Domain is less important than other factors

Joel has observed that developers aren’t as picky about the problem domain of their project as one might think; rather, other factors are more important: great co-workers, nice working environment, working for a boss who is a developers, etc. Top developers want to work on a high quality end product worthy of taking pride in.

4) All Projects need a lot of Grunt Work

To build a high quality product, in any problem domain, will require spending a great amount of your time on grunt work: tracking down bugs and fixing them, filling in feature “holes”, cleaning up design problems, improving GUI layouts, and the like. This is true for any problem domain. Top developers know this, and just get on with doing that work when needed.

Conclusion

My worry was unjustified. Our work at Oasis Digital is interesting and worthwhile, both for our customers and for our developers. To grow our team, we must focus on making it an increasingly good place to work.

A final anecdote: Joel mentioned that FogBugz 6 was feature-complete in the summer of 2006 – that means they spent around a year polishing it, fixing bugs, filling in holes, etc. That shows a phenomenal amount of dedication and discipline to create quality software.

Yet Another Python Success Story

Is it OK to use programming language X in a production enterprise application? Or are fear, uncertainly, and doubt holding you back? Public “success stories” might make it more acceptable for you to do so in your environment.

In that spirit I offer our story of a production Python deployment at an Oasis Digital customer (without names or details, to protect their privacy). There are many other success stories at Python.org. In this project, the client and application server (the bulk of the system) are written in Delphi (which was much more popular when the project started, than it is today). A major subsystem (roughly 1/3rd of the overall system) is written in Python. It consists of a set modules that parse textual data from a large number of varied formats, into a common schema, another set of modules to apply (frequently adjusted) business rules, and a third set of miscellaneous modules. These are all used in background data processing, not part of a client application or a server handling requests from a client application. These modules interact with the rest of the system primarily through state stored in a database. I generally recommend against the database integration style between separate applications, but it works well in this scenario within modules of the same application, built and maintained concurrently by the same team.

The Good

We chose Python for this subsystem for a variety of reasons. First, its built-in features are well suited to the text processing task at hand. Python’s “batteries included” have generally avoided the need to find or implement add-on text processing tools (which would have been necessary in Delphi); thus a programmer needs to know and use “just” what’s in the Python box, with few external libraries to consider.

Second, Python’s built in features and compact syntax have shortened the programming time considerably, in our estimation, than would have been otherwise. It takes relatively few lines of Python to get the job done. We have many lines of Python, and would have had many more lines had we used a lower-level language. (Of course lines of code is not everything; it’s possible to come up with dense, bad code. As a general rule, though, a language in which you can express what you need to express more succinctly, is better.)

Third, Python’s interpreted nature keeps the edit-test cycles short, further speeding development. This development speed issue is especially important given the niche this project occupies, in which data format changes and rule changes sometimes arrive with no notice: new data arrives, and part of the application does not work until it is enhanced to handle the new data. Extensive use of automated unit and integration tests (many hundreds of test cases) effectively prevents the interpreted operation from causing trivial runtime errors (type errors, syntax errors).

Of course, some other languages with similar compact syntax, included libraries, and high level features, would have worked equally well. At the time(5+ years ago), though, Python appeared to have the most momentum, other than Perl. Perl would have been a good choice also from a technical point of view, but it had an (unwarranted) reputation of being hard to maintain, which I didn’t want to have to overcome.

The Bad

There are downsides to our two-language approach. The first is somewhat Python specific, and really not a big deal: Python is slow. Its primitives are fast, but when you write considerable Python code to do something, it does that something at a rather leisurely pace compared to Delphi or Java or C++, using a lot more CPU along the way. The practical impact of this has been limited, because the bottleneck on this system is not the Python code, it is the database; but still, this has been inconvenient, and has required our customer to deploy multiple machines for this subsystem where one would have been sufficient with a more efficient language/runtime. Doing so is not particularly expensive, though, and adds a measure of reliability, so we haven’t had a need to speed things up with Psyco, C modules, etc.

The other issue is more serious. It is created by the large gap in language style and features between Delphi vs. Python in particular, and low-level vs. scripting languages in general. (Those of you unfamiliar with Delphi may be thinking this is because Delphi is some hideous VB-like toy. Wrong. Delphi is a somewhat C++-like or Java-like language, statically compiled, fast, and sadly burdened with a Pascal syntax.) Personally, this gap bothers me not at all. I’ve written production code in assembly, C, C++, Delphi, Java, Python, Ruby, Javascript, Lua, a bit of Prolog, and others I forget right now; I am happy to use one language in the morning and another radically different one in the afternoon.

I have discovered that most developers are not like me, though. Most Delphi developers are notably uninterested in Python and vice versa. As a result, our project team has ended up divided along the same lines as the software, with some cross training but relatively little production development crossover between the developers working in each languages. This is an obstacle to any developer taking end-to-end responsibility for features or issues that span the languages, and also an obstacle to hiring.

Python itself is not much of an obstacle to hiring: while there are far fewer Python programmers than Java (for example) programmers, there are also far fewer Python jobs than Java jobs.

The Verdict

In spite of the downsides discussed above, overall it has been a “win”, technically, to use two languages (each well suited to part of the application) in this project. More importantly, I am also confident this choice has been a win for our customer: they got a system delivered faster, and at lower cost, than they otherwise would have. They used every bit of speed we could deliver, to win business from their competitors.

However, the world has improved a lot since this decision was made; today I could probably choose a single language / toolset which meets all the needs sufficiently well, and thus avoid the downsides of the two-language solution. Alternatively, if starting today I might build the infrastructure for all subsystems in the same base language, with hooks to use Lua or Javascript scripting to accommodate the need for rapid runtime logic changes. It’s even possible that we would port the existing code to another language in the future – which would not make the original decision a mistake.

Fourteen Tools for a Productive Distributed Team

A geographically distributed software development team (“distributed team”, for short) is simply one where developers don’t work in close physical proximity (within a few hundred feet). In such a time you interact mostly via electronic means.

To some readers a distributed team will sound like an obviously ridiculous idea, while to others it will sound quite normal. A great many companies are distributed nowadays, as are essentially all open source projects, including the largest and most successful such projects. (Does open source seem like a toy to you? Not “real”? Consider this: in your real project you need large amounts of money to get everyone to work together to build something. Open source projects somehow accomplish this without the carrot and stick of money. Which is more impressive?)

Distributed teams appear to be most common in small firms, but certainly aren’t rare elsewhere. At some large firms, there are teams that work at the same campus or even in the same building, yet are inconveniently spread out enough that they are borderline distributed.

At Oasis Digital we operate mostly as a geographically distributed team, or rather as a set of teams (for various projects). We are scattered around our city, our country, and in a few cases, globally. We’ve learned a lot about how to succeed this way, summarized as 14 tools for a productive distributed team:

(1) Online meeting  / telephone. Effective online meeting tools are available inexpensively or sometimes free; telephone calls, including mobile, long distance, and conference calls are free or inexpensive. Either way, don’t hesitate to spend time together and live discussion, with at least voice and ideally video. Especially the early stages of a project, this is by far the closest match to in-person interaction, of all the tools in this list. At Oasis Digital, we typically use a regular weekly discussions as a baseline, with more discussions of all kinds as needed.  (Frequently at the beginning of a project, less frequently over tim.)

(2) Instant Messaging. Beware that your time is valuable though; don’t type out a long conversation, when it gets complex talk live instead. The key value in IM is as a substitute for the awareness of who is available and working, that would be obvious in the same office.

(3) Email is especially helpful to create a thread, a trail of key decisions. I suggest summarizing key decisions, especially when they span between different teams, in a (hopefully concise) email – even if the decision was reached by meeting, phone, IM, in-person, etc. As above, beware that email can consume unlimited amount of time, and don’t hesitate to  switch to live discussion if it gets complex.

(4) Airplane / Car / Train. The most powerful tool to help a distributed team work well is occasional in-person interaction.

(5) Documents. Communicate a complex idea, especially one that will need to be explained again and again, clearly in a document. (An aside on “template” documents: I’ve found them to be of limited value, though we use them when needed to meet customer requirements.)

(6) Screen Sharing. Using Remote Desktop, VNC, and other tools, it’s possible to do “pair programming” from half a world away. In our projects we do a only a little of this, but a few hours of pair programming on a tricky area can work wonders. Screen sharing is also extremely useful for demonstrating new features, reproducing bugs, working with customers, and much more.

(7) Screencasts are video/audio recordings of the computer screen and a person talking; they are very useful for explaining a feature or module to another developer. Recording then viewing a screencast is not as effective as sitting alongside someone explaining in real-time, but it has the tremendous advantage of re-playability. A series of (high quality) screencasts explaining important parts of a system will get new team members up to speed quickly.

(8) Audio recordings. Record an audio explanation of a feature or bug, while walking around or driving, on an inexpensive hand-held digital voice recorder. Then send it to another developer as is, or have it transcribed. Again, it’s not quite as effective or high-fidelity (of either the audio or the ideas) as an in-person explanation, but can be replayed, sent to multiple recipients, etc. A caveat here is that people read much faster than they listen, so if you have something long-winded (or long-lived, important) to say, have it transcribed then edit it. You may find, as I have, that it’s far easier to compose a long detailed explanation this way, than via direct typing.

(9) Screen Shots. Don’t just tell; show. Show another developer on your team what you mean by taking a screenshot them marking it up.  Unlike a screen recording, it’s easy to go through a set of screenshots and update just one of them in the middle if things change.

(10) Mockups. A distributed team leaves more room for misunderstanding of desired results. Counteract this by building a mockup (paper, Excel, etc.) of what you want.

(11) Issue / Bug Tracking. If you don’t have an issue tracker for your team, stop reading this now and go get one: Mantis, Trac, Jira, there are hundreds to choose from. (Go ahead. I’ll wait.) A common approach for collocated teams is to use a tracking system loosely, updating it occasionally. In this usage model, tracker items act mostly as high level token for more detailed information exchanged in conversations. This approach can be made to work on a distributed team, but in my experience it is unwise. Instead, get every issue in to your tracker and keep it up to date aggressively. Don’t let the sun go down with your issue tracker mismatching reality. Assume that others on your team will check the issue tracker then make important decisions relying on the data therein. If they see something that was fixed 3 days ago, but still marked as broken, bad decisions can easily result. Anyone on the team should be able to see the essential status of every current issue in the tracker, not by a time consuming process of asking someone about each item.

(12) Source Control. You’re already using this, of course. For the future, consider a distributed source control tool (bzr, git, svk, Mercurial, etc.) for your distributed team – if you’ve only used centralized tools (CVS, SVN, ClearCase, etc.) you’ll be surprised how helpful it is to have the project version history available locally, among other benefits. (See Linus explain distributed source control.)

(13) Code Review. Some collocated teams enjoy social camaraderie, but squander the benefits of proximity by working in technical isolation. Your distributed team can outperform them easily: avoid technical isolation. Read each other’s code, comment on it, learn from each other.

(14) Automated Builds. Surely there is hardly anyone left by 2007 who isn’t using an automated, continuous build process on a server somewhere. The value of this is amplified in a distributed team, because the stream of successful builds (we fire off a build after each commit) helps keep everyone in sync.

A related but separate question is where distributed team members work: in their homes, in offices outside their homes, in public spaces, etc. I believe the best answer is a blend of these; I mentioned hiding in a cafe to get some work done before, and I’ll write more about the Where question in a later post.

Another related question is whether any of this is even a good idea; stayed tuned to read more on this as well.

YouTube Scalability Talk

Cuong Do of YouTube / Google recently gave a Google Tech Talk on scalability.

I found it interesting in light of my own comments on YouTube’s 45 TB a while back.

Here are my notes from his talk, a mix of what he said and my commentary:

In the summer of 2006, they grew from 30 million pages per day to 100 million pages per day, in a 4 month period. (Wow! In most organizations, it takes nearly 4 months to pick out, order, install, and set up a few servers.)

YouTube uses Apache for FastCGI serving. (I wonder if things would have been easier for them had they chosen nginx, which is apparently wonderful for FastCGI and less problematic than Lighttpd)

YouTube is coded mostly in Python. Why? “Development speed critical”.

They use psyco, Python -> C compiler, and also C extensions, for performance critical work.

They use Lighttpd for serving the video itself, for a big improvement over Apache.

Each video hosted by a “mini cluster”, which is a set of machine with the same content. This is a simple way to provide headroom (slack), so that a machine can be taken down for maintenance (or can fail) without affecting users. It also provides a form of backup.

The most popular videos are on a CDN (Content Distribution Network) – they use external CDNs and well as Google’s CDN. Requests to their own machines are therefore tail-heavy (in the “Long Tail” sense), because the head codes to the CDN instead.

Because of the tail-heavy load, random disks seeks are especially important (perhaps more important than caching?).

YouTube uses simple, cheap, commodity Hardware. The more expensive the hardware, the more expensive everything else gets (support, etc.). Maintenance is mostly done with rsync, SSH, other simple, common tools.
The fun is not over: Cuong showed a recent email titled “3 days of video storage left”. There is constant work to keep up with the growth.

Thumbnails turn out to be surprisingly hard to serve efficiently. Because there, on average, 4 thumbnails per video and many thumbnails per pages, the overall number of thumbnails per second is enormous. They use a separate group of machines to serve thumbnails, with extensive caching and OS tuning specific to this load.

YouTube was bit by a “too many files in one dir” limit: at one point they could accept no more uploads (!!) because of this. The first fix was the usual one: split the files across many directories, and switch to another file system better suited for many small files.

Cuong joked about “The Windows approach of scaling: restart everything”

Lighttpd turned out to be poor for serving the thumbnails, because its main loop is a bottleneck to load files from disk; they addressed this by modifying Lighttpd to add worker threads to read from disk. This was good but still not good enough, with one thumbnail per file, because the enormous number of files was terribly slow to work with (imagine tarring up many million files).

Their new solution for thumbnails is to use Google’s BigTable, which provides high performance for a large number of rows, fault tolerance, caching, etc. This is a nice (and rare?) example of actual synergy in an acquisition.

YouTube uses MySQL to store metadata. Early on they hit a Linux kernel issue which prioritized the page cache higher than app data, it swapped out the app data, totally overwhelming the system. They recovered from this by removing the swap partition (while live!). This worked.

YouTube uses Memcached.

To scale out the database, they first used MySQL replication. Like everyone else that goes down this path, they eventually reach a point where replicating the writes to all the DBs, uses up all the capacity of the slaves. They also hit a issue with threading and replication, which they worked around with a very clever “cache primer thread” working a second or so ahead of the replication thread, prefetching the data it would need.

As the replicate-one-DB approach faltered, they resorted to various desperate measures, such as splitting the video watching in to a separate set of replicas, intentionally allowing the non-video-serving parts of YouTube to perform badly so as to focus on serving videos.

Their initial MySQL DB server configuration had 10 disks in a RAID10. This does not work very well, because the DB/OS can’t take advantage of the multiple disks in parallel. They moved to a set of RAID1s, appended together. In my experience, this is better, but still not great. An approach that usually works even better is to intentionally split different data on to different RAIDs: for example, a RAID for the OS / application, a RAID for the DB logs, one or more RAIDs for the DB table (uses “tablespaces” to get your #1 busiest table on separate spindles from your #2 busiest table), one or more RAID for index, etc. Big-iron Oracle installation sometimes take this approach to extremes; the same thing can be done with free DBs on free OSs also.

In spite of all these effort, they reached a point where replication of one large DB was no longer able to keep up. Like everyone else, they figured out that the solution database partitioning in to “shards”. This spread reads and writes in to many different databases (on different servers) that are not all running each other’s writes. The result is a large performance boost, better cache locality, etc. YouTube reduced their total DB hardware by 30% in the process.

It is important to divide users across shards by a controllable lookup mechanism, not only by a hash of the username/ID/whatever, so that you can rebalance shards incrementally.

An interesting DMCA issue: YouTube complies with takedown requests; but sometimes the videos are cached way out on the “edge” of the network (their caches, and other people’s caches), so its hard to get a video to disappear globally right away. This sometimes angers content owners.

Early on, YouTube leased their hardware.

Business of Software 2007 Conference

I just registered for the Business of Software 2007 conference this fall (Oct 29 and 30) in San Jose. The organizers have put together an impressive list of speakers, and the schedule is mostly one-track: everyone in the same session. I prefer conference like that, to those with many “tracks”.

This is the first year for this conference, so I have no idea how popular it might or might not be. At “worst” case, it’s a small crowd, which usually results in great conversations.

Update: Brian Button pointed me to this interview with Tim Lister, which includes hints about the salacious topics he’ll talk about at the conference (Tim, co-author of Peopleware, is a headline speaker at the BOS conference).

Next Big Language = JavaScript

There’s a lot of buzz about Steve Yegge’s “port” of Rails to JavaScript, and Steve has now provided (in his funny, self-deprecating style) the background of how it came to be. He doesn’t quite say it explicitly in this post, but I think it reveals that the “Next Big Language” he has been hinting at is JavaScript.

I (mostly) agree:

JavaScript is in nearly every browser, including tiny ones (like the one in my BlackBerry Pearl). It may be the single most widely available language today.

Because of the above, an enormous population of JavaScript programmers (though sometimes of dubious skill) has emerged.

Starting with Java 6 it’s “in the box” there also. To me, this makes it the likely winner, by a wide margin, for a dynamic language to be used at Java shops or inside Java projects. Being “in the box” is a powerful advantage, one which the many other contenders will have a hard time overcoming.

Adobe’s new JavaScript virtual machine implementation, which they handed over to Mozilla as “Tamarin”, sounds like it will boost JavaScript performance great, making it good enough for a very wide variety of projects.

JavasScript uses curly braces, like the last few Big Languages.

Like Java, C, C++, etc., JavaScript has specs and multiple competing, complete, current, high quality implementations. This, to me, is a big advantage over Ruby, Python, and other currently popular dynamic languages. Of course there is plenty of room in the industry for these language to thrive also, I am not saying any of them will go away; we use Python with great results and expect to keep doing so.
Mark Volkmann initially thought I was nuts to predict JavaScript as a winner but came around a few month later (and said so in a user group talk).

In a project at work, we’d adopted JavaScript as our plugin extension language for user-customizable rules (billing rules, etc.). I’d have chosen Lua (as I did for another project), but there are at least 1000x as many JavaScript programs out there. So far it has worked very well. If we had it to do over we might implement far more of the project in JavaScript.

However, there are a few reasons why I only “mostly” agree:

First, with JavaScript there isn’t a good way to avoid shipping source code. Sure, you an obfuscate JavaScript with various tools, but the results remains far for amenable to readable-source recovery than in a more traditionally compiled language. For open source projects this is no big deal, but there are also many worthwhile businesses and projects which depend on proprietary, not open software (including most of our projects), and it’s not year clear that obfuscation is sufficient protection. (Update in reply to a comment below: This matters even for server-side software, because some of us create and sell software products for other people to run on their servers.)

Second, at the moment JavaScript appears to lack a module system, without which it’s painful to build large systems. I expect an upcoming language version will address this.