Last year I posted about the use of a Joe Celko-style nested set hierarchy representation, for fast hierarchy queries. Here I will describe another approach which is simpler to query, but more wasteful of space. I did not invent this transitive closure approach, I learned of it from several directions:
- Celko wrote about it as “Path Enumeration” his 1995 book “SQL for Smarties: Advanced SQL Programming” (now in its third edition)
- I’ve seen it listed in various articles and web sites
- The relevant page of documentation for the Mondrian OLAP server calls it a “transitive closure table”. There is a page on their wiki with a tool to create them, and a forum post with Java code for this purpose.
- Steven Lott’s “Creating a Transitive Closure to Optimize Rollups” article
- Most recently, I saw this called an ‘ancestor table’ at evolt.com.
There are two (main) places to put the code for building a closure tables: in your application code, or in your database. The application approach is nice if you are aiming to avoid vendor-specific SQL code, but it is quite simple in SQL, and therefore not a big problem to recode for another RDBMS if the need arises. The SQL approach also avoids round-tripping the relevant data in and out of the database. Therefore, the approach I generally recommend for this is an SQL stored procedure / function.
Here is a simplified PostgreSQL stored procedure to do the job; note that his assumes a “widget” table with a widget_id and parent_id (the “adjacency” representation of a hierarchy), and a widget_closure table with fields (ancestor_id, widget_id, depth):
CREATE OR REPLACE FUNCTION populate_widget_closure() RETURNS integer AS ' DECLARE distance int; BEGIN delete from widget_closure; insert into widget_closure select widget_id, widget_id, 0 from widget; FOR distance in 1..20 LOOP insert into widget_closure select uc.parent_id, u.widget_id, distance from widget_closure uc, widget u where uc.child_id=u.reports_to_id and uc.distance = distance-1; END LOOP; RETURN 1; END; ' LANGUAGE plpgsql;
This sample code assumes a maximum depth of 20, and has no error checking. It will blindly miss greater depths and produce garbage if there is a “loop” in the ancestry. I recommend both arbitrary depth handling and error checking for production use.
Once your transitive closure table is populated, querying it is extremely fast and simple. You can get all the descendants of widget 12345 with the query “select widget_id from widget_closure where ancestor_id=12345”. Of course, this hierarchy representation, while simple to generate, is not simple to incrementally update as the hierarchy changes. The most straightforward way to use it is as a cache, regenerated as needed.