July 2007 – Kyle Cordes

YouTube Scalability Talk

Cuong Do of YouTube / Google recently gave a Google Tech Talk on scalability.

I found it interesting in light of my own comments on YouTube’s 45 TB a while back.

Here are my notes from his talk, a mix of what he said and my commentary:

In the summer of 2006, they grew from 30 million pages per day to 100 million pages per day, in a 4 month period. (Wow! In most organizations, it takes nearly 4 months to pick out, order, install, and set up a few servers.)

YouTube uses Apache for FastCGI serving. (I wonder if things would have been easier for them had they chosen nginx, which is apparently wonderful for FastCGI and less problematic than Lighttpd)

YouTube is coded mostly in Python. Why? “Development speed critical”.

They use psyco, Python -> C compiler, and also C extensions, for performance critical work.

They use Lighttpd for serving the video itself, for a big improvement over Apache.

Each video hosted by a “mini cluster”, which is a set of machine with the same content. This is a simple way to provide headroom (slack), so that a machine can be taken down for maintenance (or can fail) without affecting users. It also provides a form of backup.

The most popular videos are on a CDN (Content Distribution Network) – they use external CDNs and well as Google’s CDN. Requests to their own machines are therefore tail-heavy (in the “Long Tail” sense), because the head codes to the CDN instead.

Because of the tail-heavy load, random disks seeks are especially important (perhaps more important than caching?).

YouTube uses simple, cheap, commodity Hardware. The more expensive the hardware, the more expensive everything else gets (support, etc.). Maintenance is mostly done with rsync, SSH, other simple, common tools.
The fun is not over: Cuong showed a recent email titled “3 days of video storage left”. There is constant work to keep up with the growth.

Thumbnails turn out to be surprisingly hard to serve efficiently. Because there, on average, 4 thumbnails per video and many thumbnails per pages, the overall number of thumbnails per second is enormous. They use a separate group of machines to serve thumbnails, with extensive caching and OS tuning specific to this load.

YouTube was bit by a “too many files in one dir” limit: at one point they could accept no more uploads (!!) because of this. The first fix was the usual one: split the files across many directories, and switch to another file system better suited for many small files.

Cuong joked about “The Windows approach of scaling: restart everything”

Lighttpd turned out to be poor for serving the thumbnails, because its main loop is a bottleneck to load files from disk; they addressed this by modifying Lighttpd to add worker threads to read from disk. This was good but still not good enough, with one thumbnail per file, because the enormous number of files was terribly slow to work with (imagine tarring up many million files).

Their new solution for thumbnails is to use Google’s BigTable, which provides high performance for a large number of rows, fault tolerance, caching, etc. This is a nice (and rare?) example of actual synergy in an acquisition.

YouTube uses MySQL to store metadata. Early on they hit a Linux kernel issue which prioritized the page cache higher than app data, it swapped out the app data, totally overwhelming the system. They recovered from this by removing the swap partition (while live!). This worked.

YouTube uses Memcached.

To scale out the database, they first used MySQL replication. Like everyone else that goes down this path, they eventually reach a point where replicating the writes to all the DBs, uses up all the capacity of the slaves. They also hit a issue with threading and replication, which they worked around with a very clever “cache primer thread” working a second or so ahead of the replication thread, prefetching the data it would need.

As the replicate-one-DB approach faltered, they resorted to various desperate measures, such as splitting the video watching in to a separate set of replicas, intentionally allowing the non-video-serving parts of YouTube to perform badly so as to focus on serving videos.

Their initial MySQL DB server configuration had 10 disks in a RAID10. This does not work very well, because the DB/OS can’t take advantage of the multiple disks in parallel. They moved to a set of RAID1s, appended together. In my experience, this is better, but still not great. An approach that usually works even better is to intentionally split different data on to different RAIDs: for example, a RAID for the OS / application, a RAID for the DB logs, one or more RAIDs for the DB table (uses “tablespaces” to get your #1 busiest table on separate spindles from your #2 busiest table), one or more RAID for index, etc. Big-iron Oracle installation sometimes take this approach to extremes; the same thing can be done with free DBs on free OSs also.

In spite of all these effort, they reached a point where replication of one large DB was no longer able to keep up. Like everyone else, they figured out that the solution database partitioning in to “shards”. This spread reads and writes in to many different databases (on different servers) that are not all running each other’s writes. The result is a large performance boost, better cache locality, etc. YouTube reduced their total DB hardware by 30% in the process.

It is important to divide users across shards by a controllable lookup mechanism, not only by a hash of the username/ID/whatever, so that you can rebalance shards incrementally.

An interesting DMCA issue: YouTube complies with takedown requests; but sometimes the videos are cached way out on the “edge” of the network (their caches, and other people’s caches), so its hard to get a video to disappear globally right away. This sometimes angers content owners.

Early on, YouTube leased their hardware.

Business of Software 2007 Conference

I just registered for the Business of Software 2007 conference this fall (Oct 29 and 30) in San Jose. The organizers have put together an impressive list of speakers, and the schedule is mostly one-track: everyone in the same session. I prefer conference like that, to those with many “tracks”.

This is the first year for this conference, so I have no idea how popular it might or might not be. At “worst” case, it’s a small crowd, which usually results in great conversations.

Update: Brian Button pointed me to this interview with Tim Lister, which includes hints about the salacious topics he’ll talk about at the conference (Tim, co-author of Peopleware, is a headline speaker at the BOS conference).

Pipe RGB data to ffmpeg

A while back I asked on the ffmpeg mailing list how to pipe RGB data in to ffmpeg. I described it as follows:

in my code I am building video frames, 720x480x24bit. I have in mind generating a large number of these, as long as a full DVD worth at 30fps, then using ffmpeg (followed by dvdauthor) to encode them in to MPEG2 for DVD usage.

There were a few replies, but no definitive answer. With considerable experimentation, I got it to work. It turns out that (as far as I can tell) ffmpeg does not have the ability to accept piped in RGB frames. It will however accept piped in data in its “yuv4mpegpipe” format. With some searching and reading I found that this is roughly akin to the format of raw DV video; each frame consists of a header something like this:

YUV4MPEG2 W%d H%d F%d:%d Ip A0:0 C420mpeg2 XYSCSS=420MPEG2

… then an LF character, then data for the the Y, U, and V “planes”. The Y data is full resolution, while the U and Y are half-resolution (this is called “420” in the video world). These planes are uncompressed, one byte per pixel. All of my past work with computer video (going back to Commodore 64s and Apple IIs) has arranged all of the bits for each pixel within a few bytes of each other; this format (with all the Y data for the whole frame, then all the U data, then all the V data) is starkly different.

The essential problem remaining was how to convert RGB to YUV. Happily there are plenty of online references for this. Unhappily there are few fast implementations, and a naive implementation will be very slow. I solved this problem by finding and hiring an expert in low-level data processing with MMX, SSE2, etc. instructions. (I am not in a position to publish that code here.)

In retrospect, though, there are routines included in Intel’s “Integrated Performance Primitives” library which perform this transformation in a highly optimized way. IPP is a bargain: for only a few hundred dollars you get a wealth of high optimized ready-to-use library routines for signal processing.

The ffmpeg piping solution consists, therefore, of:

A module which generated frames in RGB format, to contain whatever contents your application requires.
A module to very quickly convert these to YUV in yuv4mpegpipe format (write your own, or use routines in IPP, for the RGB->YUV420 part).
Pipe this data stream to ffmpeg with stdin; ffmpeg is invoked something like this: ffmpeg -y -f yuv4mpegpipe -i – -i audio.mp3 -target ntsc-dvd -aspect 4:3 foo.mpg

By using a multicore CPU and threads, this whole process can be made to happen in real time or better (i.e., one second of “wall clock” processing time, for one second of finished MPEG2 video). The resulting MPEG2 file can be used with a DVD authoring application to produce a ready-to-burn DVD ISO image.

Update: the data format above is published here as part of the mjpegtools man pages.

Why I do not use RAR

We recently adopted a policy (ooh, so official sounding…) at work that we do not use the RAR file format. Oasis Digital being a small firm, the “we” to make this particular decision was just me. Someone quite reasonably asked, why not?

Here is why I don’t use RAR:

The RAR archiver GUI tool is Windows-only, though there are command line tools available for other platforms. The other major choices are fully multi-platform, with command line, GUI, etc. all available. Most of our work is on Windows, but I don’t see a reason to choose a Windows-only tool when others are available.
I get the impression that the vast majority of people who use RAR use the archiver “trialware” permanently without ever paying for it. I earn a living by writing software, so I don’t want to support the notion of using commercial software without paying for it. The same could be said of WinZIP – but I don’t use WinZIP and there are plenty of other ZIP tools.
RAR is a proprietary format, while the other choices (like plain old ZIP) are open and supported out of the box by countless tools, built in to OSs, etc.
RAR’s compression is sometimes a lot better than ZIP; but for the cases where the extra compression matters, something like 7zip’s 7z format offers similar compression but with an open format and free (not trialware) tool. (7zip can also unarchive RAR files, for those cases when a RAR comes in with something I need.)
The world simply does not need another proprietary archive format, so I will my part by not supporting the creation or distribution thereof.