Help! My Hierarchy is Slow – Faster Hierarchies with Nested Sets

A great many applications, including many that I’ve worked on, have a hierarchy of things: of parts, of people, of organizations, etc. The way most of us represent such hierarchy is with the first thing that generally comes to mind: make each Widget have a parent Widget, with a table like so:

create table widget (widget_id int, parent_widget_id int, other_fields_here);

This representation is called an “adjacency list”, and is simple and easy. You can readily build a tool to manipulate a hierarchy stored this way. Many off the shelf visual components, for both client-side and web applications, know how to manipulate hierarchies represented this way. Some reporting tools know how to report on hierarchies represented this way.

However, for answering common questions like “who all is under person X in the hierarchy”, the adjacency list approach is unwieldy and slow.

There are various other approaches to representing a hierarchy, most of them discussed in detail in Joe Celko’s articles and books, prominently in the book Trees and Hierarchies in SQL for Smarties. If you work with SQL and hierarchies, buy this book now.

One approach Celko is especially fond of is the “nested set” representation. You can read about it online here and here.

Of course, changing an entire application to use nested sets might be a very big deal in a mature application. That’s OK; in most cases we can get much of the benefit by building a nested-sets “cache” of the share of the hierarchy, with a table like so:

create table widget_hier_cache (widget_id int, left int, right int);

Each time the hierarchy has changed, or before each time we need to run complex queries, delete the rows in this cache table and repopulate them based on the current canonical adjacency data. Celko offers SQL code in his book to do that, which could be translated to work in the stored procedure language of the DBMS at hand. But what about DBMSs that don’t offer stored procedures, such as lightweight local databases (SQLite), MySQL, etc.? The translation must be done in application code instead.

I wrote such code in Delphi a while back, in the process of getting a full understanding of this problem; I’ve cleaned it up and now offer code for download here (DelphiAdjacencyNestedSets.zip), under an open source license (MIT license – use it all you want). I tested this today with Delphi 2007 Win32, but it should work fine at least back to Delphi 7. As far as I can tell with some searching, this is the only Delphi code for translating adjacency to nested sets available on the internet. This code doesn’t know about databases – it is a module to which you feed adjacency data, and from which you get back nested set data. It includes DUnit test cases.

I’ve also put the code on github, for easy browsing and forking.

(Update in August 2007: a new version (DelphiAdjacencyNestedSets2.zip) optionally tolerates “orphan” nodes and forests. Update in January 2008: a newer version (DelphiAdjacencyNestedSets3.zip) propagates an integer value down the hierarchy; it is named Tag as a nod to the .Tag property on VCL components. Both of these newer versions were tested with Delphi 2006/2007 also.)

The essence of the translation is a depth-first traversal of the hierarchy, and of course this can be easily implemented in other languages; the Delphi code is easy to understand, so don’t be afraid to take a look even if you need some Java or C# or PHP etc. I also stumbled across this PHP nested sets implementation, which offers a set of functions to maintain (insert, update, etc.) a hierarchy stored as nested sets, rather than only translate from adjacency to nested sets.

Another useful way to represent a hierarchy for fast querying is with a transitive closure table. I’ll write this up in a future post; it turns out to be especially useful (and necessary) to make arbitrary hierarchies work in the Mondrian OLAP server.

Pipe RGB data to ffmpeg

A while back I asked on the ffmpeg mailing list how to pipe RGB data in to ffmpeg. I described it as follows:

in my code I am building video frames, 720x480x24bit. I have in mind generating a large number of these, as long as a full DVD worth at 30fps, then using ffmpeg (followed by dvdauthor) to encode them in to MPEG2 for DVD usage.

There were a few replies, but no definitive answer. With considerable experimentation, I got it to work. It turns out that (as far as I can tell) ffmpeg does not have the ability to accept piped in RGB frames. It will however accept piped in data in its “yuv4mpegpipe” format. With some searching and reading I found that this is roughly akin to the format of raw DV video; each frame consists of a header something like this:

YUV4MPEG2 W%d H%d F%d:%d Ip A0:0 C420mpeg2 XYSCSS=420MPEG2

… then an LF character, then data for the the Y, U, and V “planes”. The Y data is full resolution, while the U and Y are half-resolution (this is called “420” in the video world). These planes are uncompressed, one byte per pixel. All of my past work with computer video (going back to Commodore 64s and Apple IIs) has arranged all of the bits for each pixel within a few bytes of each other; this format (with all the Y data for the whole frame, then all the U data, then all the V data) is starkly different.

The essential problem remaining was how to convert RGB to YUV. Happily there are plenty of online references for this. Unhappily there are few fast implementations, and a naive implementation will be very slow. I solved this problem by finding and hiring an expert in low-level data processing with MMX, SSE2, etc. instructions. (I am not in a position to publish that code here.)

In retrospect, though, there are routines included in Intel’s “Integrated Performance Primitives” library which perform this transformation in a highly optimized way. IPP is a bargain: for only a few hundred dollars you get a wealth of high optimized ready-to-use library routines for signal processing.

The ffmpeg piping solution consists, therefore, of:

  1. A module which generated frames in RGB format, to contain whatever contents your application requires.
  2. A module to very quickly convert these to YUV in yuv4mpegpipe format (write your own, or use routines in IPP, for the RGB->YUV420 part).
  3. Pipe this data stream to ffmpeg with stdin; ffmpeg is invoked something like this: ffmpeg -y -f yuv4mpegpipe -i – -i audio.mp3 -target ntsc-dvd -aspect 4:3 foo.mpg

By using a multicore CPU and threads, this whole process can be made to happen in real time or better (i.e., one second of “wall clock” processing time, for one second of finished MPEG2 video). The resulting MPEG2 file can be used with a DVD authoring application to produce a ready-to-burn DVD ISO image.

Update: the data format above is published here as part of the mjpegtools man pages.

Why I do not use RAR

We recently adopted a policy (ooh, so official sounding…) at work that we do not use the RAR file format. Oasis Digital being a small firm, the “we” to make this particular decision was just me. Someone quite reasonably asked, why not?

Here is why I don’t use RAR:

  1. The RAR archiver GUI tool is Windows-only, though there are command line tools available for other platforms. The other major choices are fully multi-platform, with command line, GUI, etc. all available. Most of our work is on Windows, but I don’t see a reason to choose a Windows-only tool when others are available.
  2. I get the impression that the vast majority of people who use RAR use the archiver “trialware” permanently without ever paying for it. I earn a living by writing software, so I don’t want to support the notion of using commercial software without paying for it. The same could be said of WinZIP – but I don’t use WinZIP and there are plenty of other ZIP tools.
  3. RAR is a proprietary format, while the other choices (like plain old ZIP) are open and supported out of the box by countless tools, built in to OSs, etc.
  4. RAR’s compression is sometimes a lot better than ZIP; but for the cases where the extra compression matters, something like 7zip’s 7z format offers similar compression but with an open format and free (not trialware) tool. (7zip can also unarchive RAR files, for those cases when a RAR comes in with something I need.)
  5. The world simply does not need another proprietary archive format, so I will my part by not supporting the creation or distribution thereof.

svnmerge, a tool to manage SVN merges

We use SVN on a project with a lot of small branches, i.e. a branch for almost every non-trivial feature. This is not a particularly pleasant want to use SVN, but it meets another important need for our project: code review on the way in to the trunk (as a “gate”), rather than code review for code already in trunk (“drive-by” code review).

Today on the XPSTL mailing list, Mike Jorgensen pointed me to svnmerge:

“svnmerge.py is a tool for automatic branch management. It allows branch maintainers to merge changes from and to their branch very easily, and automatically records which changes were already merged. This allows displaying an always updated list of changes yet to be merged, and totally prevents merge mistakes (such as merging the same change twice).”

svnmerge looks rough, but should still be a big improvement of SVN alone. It’s a ways short of what’s in the box with git, bzr, etc., but is also a much smaller step for a team using SVN.

Speaking of merging, Mark Shuttlework recently argued merging is the key to software developer collaboration. To me, this is obviously true, and not only for open-source projects, but for closed-source projects also.  If it sounds untrue to you, then you and I are probably thinking of different meanings of “merge”.

synsync: another way to remotely backup / svnadmin dump an SVN repository

Last month, I described an approach using SVK to remotely clone and then “svnadmin dump” an SVN repository. It turns out that there is an easier way “in the box” in SVN 1.4: the svnsync tool. Bob Ippolito describes how to do it, here are the minimal steps:

$ MYREPO=/home/me/someplace    (do not use ~username, use /home/username)
$ svnadmin create $MYREPO
$ echo “#!/bin/sh” >$MYREPO/hooks/pre-revprop-change
$ chmod +x $MYREPO/hooks/pre-revprop-change
$ svnsync init file://$MYREPO http://SVN.remote.url.goes.here/
$ svnsync sync file://$MYREPO

… then repeat the “svnsync sync” to sync down the changes since last time, perhaps in a cron job.

Paul Querna describes some caveats, and Cheah Chu Yeow offers a longer walkthrough of a more “produciton-grade” svnsync setup.

Like the SVK approach, with synsync you get a local mirror of a remote repo, with no need for shell access to that repo – very helpful if the repo is hosted by, for example, a SVN hosting firm, SourceForge, or Google Code. This local mirror can be used to view project history without network access, to svnadmin dump for backups, to mirror efficiently in another source control tool (I’m doing this with git as a test now, for one of our projects), etc.

Java Documentation in a Windows HTMLHelp (CHM) file

The Java software development kit documentation can be downloaded from java.sun.com, in the form of a ZIP file with tens of thousands of linked HTML file. This seems like an ideal canonical form for this documentation, but it is inconvenient to use, and inconvenient to have so many files sitting around.

If your development machine runs Windows, a Windows HTML Help is more convenient; it consists of a handful of files (mostly one large CHM file) and offers good search and browse features. It can be unpacked or moved around far more quickly. Franck Allimant processes the HTML from Sun for each new Java release, and makes the CHMs etc. available on his web site. Recommended.

Allimant has been doing this for quite a few years now; I suspect it was unauthorized at first, but some kind of peace was made with Sun, which now links to the site.