postgresql – Page 2 – Kyle Cordes

Optimize Hierarchy Queries with a Transitive Closure Table

Last year I posted about the use of a Joe Celko-style nested set hierarchy representation, for fast hierarchy queries. Here I will describe another approach which is simpler to query, but more wasteful of space. I did not invent this transitive closure approach, I learned of it from several directions:

Celko wrote about it as “Path Enumeration” his 1995 book “SQL for Smarties: Advanced SQL Programming” (now in its third edition)
I’ve seen it listed in various articles and web sites
The relevant page of documentation for the Mondrian OLAP server calls it a “transitive closure table”. There is a page on their wiki with a tool to create them, and a forum post with Java code for this purpose.
Steven Lott’s “Creating a Transitive Closure to Optimize Rollups” article
Most recently, I saw this called an ‘ancestor table’ at evolt.com.

There are two (main) places to put the code for building a closure tables: in your application code, or in your database. The application approach is nice if you are aiming to avoid vendor-specific SQL code, but it is quite simple in SQL, and therefore not a big problem to recode for another RDBMS if the need arises. The SQL approach also avoids round-tripping the relevant data in and out of the database. Therefore, the approach I generally recommend for this is an SQL stored procedure / function.

Here is a simplified PostgreSQL stored procedure to do the job; note that his assumes a “widget” table with a widget_id and parent_id (the “adjacency” representation of a hierarchy), and a widget_closure table with fields (ancestor_id, widget_id, depth):

CREATE OR REPLACE FUNCTION populate_widget_closure()
RETURNS integer AS '
DECLARE
  distance int;
BEGIN
  delete from widget_closure;
  insert into widget_closure
    select widget_id, widget_id, 0 from widget;

  FOR distance in 1..20 LOOP
    insert into widget_closure
    select uc.parent_id, u.widget_id, distance
      from widget_closure uc, widget u
      where uc.child_id=u.reports_to_id
        and uc.distance = distance-1;
  END LOOP;
  RETURN 1;
END;
' LANGUAGE plpgsql;

This sample code assumes a maximum depth of 20, and has no error checking. It will blindly miss greater depths and produce garbage if there is a “loop” in the ancestry. I recommend both arbitrary depth handling and error checking for production use.

Once your transitive closure table is populated, querying it is extremely fast and simple. You can get all the descendants of widget 12345 with the query “select widget_id from widget_closure where ancestor_id=12345”. Of course, this hierarchy representation, while simple to generate, is not simple to incrementally update as the hierarchy changes. The most straightforward way to use it is as a cache, regenerated as needed.

Help! My Hierarchy is Slow – Faster Hierarchies with Nested Sets

A great many applications, including many that I’ve worked on, have a hierarchy of things: of parts, of people, of organizations, etc. The way most of us represent such hierarchy is with the first thing that generally comes to mind: make each Widget have a parent Widget, with a table like so:

create table widget (widget_id int, parent_widget_id int, other_fields_here);

This representation is called an “adjacency list”, and is simple and easy. You can readily build a tool to manipulate a hierarchy stored this way. Many off the shelf visual components, for both client-side and web applications, know how to manipulate hierarchies represented this way. Some reporting tools know how to report on hierarchies represented this way.

However, for answering common questions like “who all is under person X in the hierarchy”, the adjacency list approach is unwieldy and slow.

There are various other approaches to representing a hierarchy, most of them discussed in detail in Joe Celko’s articles and books, prominently in the book Trees and Hierarchies in SQL for Smarties. If you work with SQL and hierarchies, buy this book now.

One approach Celko is especially fond of is the “nested set” representation. You can read about it online here and here.

Of course, changing an entire application to use nested sets might be a very big deal in a mature application. That’s OK; in most cases we can get much of the benefit by building a nested-sets “cache” of the share of the hierarchy, with a table like so:

create table widget_hier_cache (widget_id int, left int, right int);

Each time the hierarchy has changed, or before each time we need to run complex queries, delete the rows in this cache table and repopulate them based on the current canonical adjacency data. Celko offers SQL code in his book to do that, which could be translated to work in the stored procedure language of the DBMS at hand. But what about DBMSs that don’t offer stored procedures, such as lightweight local databases (SQLite), MySQL, etc.? The translation must be done in application code instead.

I wrote such code in Delphi a while back, in the process of getting a full understanding of this problem; I’ve cleaned it up and now offer code for download here (DelphiAdjacencyNestedSets.zip), under an open source license (MIT license – use it all you want). I tested this today with Delphi 2007 Win32, but it should work fine at least back to Delphi 7. As far as I can tell with some searching, this is the only Delphi code for translating adjacency to nested sets available on the internet. This code doesn’t know about databases – it is a module to which you feed adjacency data, and from which you get back nested set data. It includes DUnit test cases.

I’ve also put the code on github, for easy browsing and forking.

(Update in August 2007: a new version (DelphiAdjacencyNestedSets2.zip) optionally tolerates “orphan” nodes and forests. Update in January 2008: a newer version (DelphiAdjacencyNestedSets3.zip) propagates an integer value down the hierarchy; it is named Tag as a nod to the .Tag property on VCL components. Both of these newer versions were tested with Delphi 2006/2007 also.)

The essence of the translation is a depth-first traversal of the hierarchy, and of course this can be easily implemented in other languages; the Delphi code is easy to understand, so don’t be afraid to take a look even if you need some Java or C# or PHP etc. I also stumbled across this PHP nested sets implementation, which offers a set of functions to maintain (insert, update, etc.) a hierarchy stored as nested sets, rather than only translate from adjacency to nested sets.

Another useful way to represent a hierarchy for fast querying is with a transitive closure table. I’ll write this up in a future post; it turns out to be especially useful (and necessary) to make arbitrary hierarchies work in the Mondrian OLAP server.

DRBD on Ubuntu

This week I experimented with DRBD (Distributed Replicated Block Device) which is “a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1.”

DRBD can be used for build a “poor man’s SAN”, a RAID running over Ethernet. In our case, we want a warm spare of an important database system, i.e. a second set of hardware which can take over for the primary within a few minutes of failure, and DRBD might be a great way accomplish that. DRBD runs on commodity hardware (ordinary servers, network adapters, etc.) so it is inexpensive. It is also, if benchmarks and comments are to be believed, surprisingly good. I suspect this is because CPUs and networks (Gigabit Ethernet, ideally) are so much faster than disks.

The systems at hand for testing here run the Ubuntu distribution, itself based on Debian. It took a lot of reading and learning, but very little actual typing, to get DRBD going on Ubuntu. My writeup here should help the next person along. As I write this, the released version of DRBD is 0.7, and the DRBD site encourages ignoring the newer unofficial release, advice that I followed.

First, read the HOWTO and the FAQ, but don’t follow the instruction in them yet; then you are ready to begin. The guessable part is to install the module with apt-get (you’ll need the “Community” APT sources in Ubuntu; I don’t know offhand what is equivalent in Debian):

apt-get install drbd0.7-module-source

This installs the module’s source code, not the module itself. After doing this, I spent a couple of hours trying to figure out how to compile it – untarring it and running “make” does not work. There is an abundance of confusing and misleading advice on the web about this, including advice to compile your own kernel. If you don’t need a custom-compiled kernel for other reasons, ignore all that and discover the joy of Debian’s “Module Assistant”, invoked by the command “m-a”.

m-a prepare

“prepare” will get your machine ready to compile modules that can link with the running kernel; it will prompt to download and install packages needed for this, including the correct kernel header packages. Now you are ready to build and install the module:

m-a a-i drbd0.7-module-source

Follow the prompts; if it asks to install more packages, agree to do so. On Ubuntu 6.10, it may fail with an error about “mv”. This is a known bug; the workaround is to expand the module source file (it lives in /usr/src/), add the line “SHELL=/bin/bash” at the top of each Makefile, then retar the module source.

By the way, this points out a stark and troubling aspect of Ubuntu “community” packages: not only are they not assured to work, they are not even casually tested to verify they compile on the Ubuntu version they are purportedly intended for use with… which I find very disappointing. I suppose the lesson is that Ubuntu is very slick for the set of officially supported packages, but if you’ll be using a lot of packages beyond that set, its appeal is much diminished.

Now you are ready to enable DRBD, following the HOWTO instructions, starting with setting up /etc/drbd.conf.

In my case I used LVM to make a 10 GB partition on each machine, put DRBD on top of that, put a filesystem (Reiserfs, for variety) on top of them, then moved/symlinked my PostgreSQL data there. It worked fine, including these tests: I started a long intensive DB operation (restoring a several-GB backup), then with that running, rebooted the secondary machine. This did not interrupt the DB operation, and when the secondary finished its reboot, the DRBD partition re-synced automatically. I then shut down the primary machine, and verified I could mount and read that same data on the secondary machine… all exactly as described.

One caveat though – in a few cases, machines with DRBD stopped during the boot sequence, waiting for user intervention (in this case, closing a shell) because they the DRBD device wouldn’t mount. I suspect this was because I don’t fully understand how to configure it.

Of course this was just a short initial look, but I am very impressed with DRBD so far.