Massive Parallelism and Microslices

I just read James Hamilton’s comments on “Microslice” servers, which are very low-power, but high CPU-to-wattage ratio servers. As he explains in detail, at scale the economics of this design are compelling. In some ways, of course, this is the opposite of another big trend going on, which is consolidation through virtualization. I reconcile these forces like so:

  1. For enterprises with a high ratio of emloyees-per-server-CPU, the cost factors tend to drive cost as a function of the number of boxes / racks /etc. This makes virtualization on to a few big servers a win.
  2. But for enterprises with a low ratio (lots of computing work, small team), the pure economics of the microserver approach makes it the winner.

The microserver approach demands:

  • better automated system adminstration, you must get to essentially zero marginal sysadmin hours per box.
  • better decompisition of the computing work in to parallelizable chunks
  • very low software cost per server (you’re going to run a lot of them), favoring zero-incremental-cost operating systems (Linux)

My advice to companies who make software to harness a cloud of tiny machines: find a way to price it so your customer pays you a similar amount to divide their work among 1000 microservers, as they would amount 250 heavier servers; otherwise if they move to microservers they may find a reason to leave you behind.

On a personal note, I find this broadening trend toward parallelization to be a very good thing – because my firm offers services to help companies through these challenges!

YouTube Scalability Talk

Cuong Do of YouTube / Google recently gave a Google Tech Talk on scalability.

I found it interesting in light of my own comments on YouTube’s 45 TB a while back.

Here are my notes from his talk, a mix of what he said and my commentary:

In the summer of 2006, they grew from 30 million pages per day to 100 million pages per day, in a 4 month period. (Wow! In most organizations, it takes nearly 4 months to pick out, order, install, and set up a few servers.)

YouTube uses Apache for FastCGI serving. (I wonder if things would have been easier for them had they chosen nginx, which is apparently wonderful for FastCGI and less problematic than Lighttpd)

YouTube is coded mostly in Python. Why? “Development speed critical”.

They use psyco, Python -> C compiler, and also C extensions, for performance critical work.

They use Lighttpd for serving the video itself, for a big improvement over Apache.

Each video hosted by a “mini cluster”, which is a set of machine with the same content. This is a simple way to provide headroom (slack), so that a machine can be taken down for maintenance (or can fail) without affecting users. It also provides a form of backup.

The most popular videos are on a CDN (Content Distribution Network) – they use external CDNs and well as Google’s CDN. Requests to their own machines are therefore tail-heavy (in the “Long Tail” sense), because the head codes to the CDN instead.

Because of the tail-heavy load, random disks seeks are especially important (perhaps more important than caching?).

YouTube uses simple, cheap, commodity Hardware. The more expensive the hardware, the more expensive everything else gets (support, etc.). Maintenance is mostly done with rsync, SSH, other simple, common tools.
The fun is not over: Cuong showed a recent email titled “3 days of video storage left”. There is constant work to keep up with the growth.

Thumbnails turn out to be surprisingly hard to serve efficiently. Because there, on average, 4 thumbnails per video and many thumbnails per pages, the overall number of thumbnails per second is enormous. They use a separate group of machines to serve thumbnails, with extensive caching and OS tuning specific to this load.

YouTube was bit by a “too many files in one dir” limit: at one point they could accept no more uploads (!!) because of this. The first fix was the usual one: split the files across many directories, and switch to another file system better suited for many small files.

Cuong joked about “The Windows approach of scaling: restart everything”

Lighttpd turned out to be poor for serving the thumbnails, because its main loop is a bottleneck to load files from disk; they addressed this by modifying Lighttpd to add worker threads to read from disk. This was good but still not good enough, with one thumbnail per file, because the enormous number of files was terribly slow to work with (imagine tarring up many million files).

Their new solution for thumbnails is to use Google’s BigTable, which provides high performance for a large number of rows, fault tolerance, caching, etc. This is a nice (and rare?) example of actual synergy in an acquisition.

YouTube uses MySQL to store metadata. Early on they hit a Linux kernel issue which prioritized the page cache higher than app data, it swapped out the app data, totally overwhelming the system. They recovered from this by removing the swap partition (while live!). This worked.

YouTube uses Memcached.

To scale out the database, they first used MySQL replication. Like everyone else that goes down this path, they eventually reach a point where replicating the writes to all the DBs, uses up all the capacity of the slaves. They also hit a issue with threading and replication, which they worked around with a very clever “cache primer thread” working a second or so ahead of the replication thread, prefetching the data it would need.

As the replicate-one-DB approach faltered, they resorted to various desperate measures, such as splitting the video watching in to a separate set of replicas, intentionally allowing the non-video-serving parts of YouTube to perform badly so as to focus on serving videos.

Their initial MySQL DB server configuration had 10 disks in a RAID10. This does not work very well, because the DB/OS can’t take advantage of the multiple disks in parallel. They moved to a set of RAID1s, appended together. In my experience, this is better, but still not great. An approach that usually works even better is to intentionally split different data on to different RAIDs: for example, a RAID for the OS / application, a RAID for the DB logs, one or more RAIDs for the DB table (uses “tablespaces” to get your #1 busiest table on separate spindles from your #2 busiest table), one or more RAID for index, etc. Big-iron Oracle installation sometimes take this approach to extremes; the same thing can be done with free DBs on free OSs also.

In spite of all these effort, they reached a point where replication of one large DB was no longer able to keep up. Like everyone else, they figured out that the solution database partitioning in to “shards”. This spread reads and writes in to many different databases (on different servers) that are not all running each other’s writes. The result is a large performance boost, better cache locality, etc. YouTube reduced their total DB hardware by 30% in the process.

It is important to divide users across shards by a controllable lookup mechanism, not only by a hash of the username/ID/whatever, so that you can rebalance shards incrementally.

An interesting DMCA issue: YouTube complies with takedown requests; but sometimes the videos are cached way out on the “edge” of the network (their caches, and other people’s caches), so its hard to get a video to disappear globally right away. This sometimes angers content owners.

Early on, YouTube leased their hardware.

A/B Technique for Web Application Deployment

This description of my “A/B technique for web application deployment” was transcribed from audio, so it less tight, more verbose than my normal prose. I chose to post it in rough form, rather than leave it on the “back burner” until an unknown future date when I have time to rewrite it. I first explained this to a colleague around 1999, 8 years is long enough for an idea to wait.

The Problem / Context

At least a dozen times over the last decade, this scenario has come up at consulting client sites: you have a web application and you want to upgrade it with a new version. You could do so with a brute force cutover (stop the app, swap the code), but that’s not the scenario that I’m talking about. I’m talking about the upgrading to a new version, not quite compatible with the old one, without dropping current active users. For example, in the new version, you might have some different data that goes in the session. You might have some different pages, so you have some different URLs, you may be adding a new field so that once you put in this new version, there’ll be an additional field, and it stores that additional field in the session and in the database and the caching and so on. Yet you want the current users to keep working without interruption.

Solution

This technique is not language-specific – it applies equally in PHP, ASP, ASP.NET, Java servlets, JSP, CGI, etc.; with nearly any application or infrastructure.

Have the URL of your application, which I’ll call “/contact” here, be the URL of a proxy application (or “launching pad”). Then have two additional URLs for two specific instances of the actual application. For example, you might have “/contact” as your overall application URL and then “/contact/contactA” or “/contact/A” as one instance where you have that application installed.

At the “/contact” URL, install a simple proxy application, whose job is to take a newly arriving user, present them an intro/login screen, then redirect them to one of the specific instances of the application.

As a user I will point my browser to the “/contact” URL, and I bookmark that. I launch that, I see a login screen, I type in my name and password, I press the button to log in. The “/contact” launching pad redirects me to “/contact/A,”, an instance of real application. I’ll call the second instance “B”, perhaps at the URL “/contact/B”. In the normal steady state of the system, the user will be using A, they login to “/contact” and they end up in the “A” instance.

Then you want to do an upgrade, install a new version. Install the new version as “/contact/B.” Leave the existing application in place and working. (By the way, I’ve assumed you are using a technology where you can deploy and undeploy applications without bringing your web server down, but with mod_proxy you could make this work even without that capability) Deploy the new application version in a new and different path than what your current running users are on. Adjust some setting your proxy/launchpad application (perhaps as simple as a single line in a single config file). So, for example, you might install the new version as “/contact/B” and then you flip a switch (edit the config file) to make it so that new users that come to the “/contact” page don’t land in “/contact/A” anymore, they land in “/contact/B” as they login.

The current users already using the application in “/contact/A” stay there – they don’t know or care that you’ve deployed a new version. New users come in the come in the new version. So you want to have some sort of mechanism (likely provided by your application server if you’re using one, and not hard to build otherwise) for monitoring how many users are using each of these applications. So you might for example notice that you have 1000 users active on /contact/A. You deploy a new version as /contact/B and flip the switch. Then, depending on the usage characteristics of your application – over the next few minutes, next few hours, however it works out, the users, as they log out and log in and such, gradually all make it into the /B application. Some kind of maximum-login-time mechanism will ensure that this cutover happens in finite time.

Once the users have moved to /contact/B, you then declare it as your “current” version, and you take down /A, because no one’s using it and no one can get into it. So that next time you need to do an upgrade, you just do it in reverse – you deploy that new version as /contact/A, flip the switch back to make all new users’ logins land in the A… and again, after however many hours or minutes or whatever, you have all your users on the new version, and you can take down the old version.

You can easily implement with just the tools that already come with your Web application development system. You don’t need any kind of special hardware or special application server or HTTP server support. You don’t need any sort of special way of doing session affinity; you’re doing session affinity by simply handing out the URL of one of these other Web applications.

Bookmarks

Someone might bookmark a page of your application. So let’s say that you had directed them to /contact/A, and they were on the page /contact/A/lists.jsp. When they return to this bookmark later (you do want to support bookmarking, right?), you don’t want them to land there; you don’t want them to end up in the A application if you’re currently using the B application. This is actually pretty easy to handle also. You simply use some settings on your Web server to do a redirect, with a few lines of configuration in .htaccess or analogous mechanims. So based on your setting of which one is current, you make it so that if someone comes into the application without having a referrer from inside the application, you just redirect them over to whatever the current instance is. And that takes a little bit of thought, but only a little, and you can make it seamlessly solve that problem of users’ bookmarks working in spite of you switching back and forth between two instances.

Clustering

You might be deployed on a cluster. Perhaps you are using Websphere with 37 web servers. It turns out that this A/B approach works orthogonally to the clustering features of your Web application server. You could have the A application deployed across all 37 servers; you could deploy the B application, with a few clicks, across all 37 servers; you could flip that switch in some global way to kick people onto the B, and so on.

Override the launchpad for testing

You can permit users to enter a special URL to get to the “other side”. you could have some way of entering a URL that takes you past that launch application straight into the B side, so that you could click around, you could manually verify that the new B application works in the production environment before you flip the switch to make that the deployed production system. This is a very wise and useful type of testing to do, a great final stage of testing because the new code in actually in production. It’s obviously not a replacement for testing in a separate test environment, it’s an adjunct for even greater safety in deploment.

Performance

When a system is running in a steady state, its caches are fully populated with relevant data, so many requests can be answered with data from the cache (RAM). But when a system is freshly started, its caches are empty, so more requests require (slower) disk access, during the first few minutes of operations. This is sometimes called the “empty cache” problem, and is responsible for the poor performance sometimes seen in the first few minutes after a busy system is restarted.

The technique described here prevents this problem, because with it you avoid ever shutting down and restarting your whole Web application with your full user population on it. Instead, since the switch only brings newly-logging-in users to the new version – the new instance – you gradually have people start using it, so you never take a big hit all at once in terms of cache population.

Schema Changes

Hani asked, in a comment, about schema changes. A simple answer is that you won’t be able to make a transition like the one described here (where both the old and new code versions run in parallel for a while), if you make schema changes such that the old code no longer works. A more complex answer, which I have used with great results, is that this is a programmable computer and you are a programmer – with some effort, you can make the software tolerate both the old and new schemas. So the process works like this:

  1. Decide on the schema change, but don’t deploy it
  2. Modify your software to tolerate the old or new schema, whichever is present
  3. Deploy the new software, transition all users to it (as described above)
  4. Make the schema change; you may need to momentarily quiesce the software, but hopefully not kill user sessions

(There are a few tools out there to help with the schema-change-in-a-live-app problem. One of then is ChronicDB who wrote me to point this out.)

Of course this is a lot more work than just stopping the server, making your change, and restarting. Whether it’s worth it depends on your situation. If you have an overnight non-usage window, consider using it instead of the long path described here.

I hope this is helpful for someone out there. Comments are welcome.