Optimize Hierarchy Queries with a Transitive Closure Table

Last year I posted about the use of a Joe Celko-style nested set hierarchy representation, for fast hierarchy queries. Here I will describe another approach which is simpler to query, but more wasteful of space. I did not invent this transitive closure approach, I learned of it from several directions:

Celko wrote about it as “Path Enumeration” his 1995 book “SQL for Smarties: Advanced SQL Programming” (now in its third edition)
I’ve seen it listed in various articles and web sites
The relevant page of documentation for the Mondrian OLAP server calls it a “transitive closure table”. There is a page on their wiki with a tool to create them, and a forum post with Java code for this purpose.
Steven Lott’s “Creating a Transitive Closure to Optimize Rollups” article
Most recently, I saw this called an ‘ancestor table’ at evolt.com.

There are two (main) places to put the code for building a closure tables: in your application code, or in your database. The application approach is nice if you are aiming to avoid vendor-specific SQL code, but it is quite simple in SQL, and therefore not a big problem to recode for another RDBMS if the need arises. The SQL approach also avoids round-tripping the relevant data in and out of the database. Therefore, the approach I generally recommend for this is an SQL stored procedure / function.

Here is a simplified PostgreSQL stored procedure to do the job; note that his assumes a “widget” table with a widget_id and parent_id (the “adjacency” representation of a hierarchy), and a widget_closure table with fields (ancestor_id, widget_id, depth):

CREATE OR REPLACE FUNCTION populate_widget_closure()
RETURNS integer AS '
DECLARE
  distance int;
BEGIN
  delete from widget_closure;
  insert into widget_closure
    select widget_id, widget_id, 0 from widget;

  FOR distance in 1..20 LOOP
    insert into widget_closure
    select uc.parent_id, u.widget_id, distance
      from widget_closure uc, widget u
      where uc.child_id=u.reports_to_id
        and uc.distance = distance-1;
  END LOOP;
  RETURN 1;
END;
' LANGUAGE plpgsql;

This sample code assumes a maximum depth of 20, and has no error checking. It will blindly miss greater depths and produce garbage if there is a “loop” in the ancestry. I recommend both arbitrary depth handling and error checking for production use.

Once your transitive closure table is populated, querying it is extremely fast and simple. You can get all the descendants of widget 12345 with the query “select widget_id from widget_closure where ancestor_id=12345”. Of course, this hierarchy representation, while simple to generate, is not simple to incrementally update as the hierarchy changes. The most straightforward way to use it is as a cache, regenerated as needed.

6 thoughts on “Optimize Hierarchy Queries with a Transitive Closure Table”

Alex Miller says:

January 13, 2008 at 9:35 pm

Nice. Yet another episode in the long history of trading space for time… 🙂
Craig Buchek says:

January 15, 2008 at 11:11 am

Thanks for sharing this. I read your article on Celko’s method, and followed up by reading the original articles by Celko. Seems like a good method in some situations, but Celko’s method would seem to have poor insert/update performance as well, due to having to update, on average, half of the records in the table.

I’d be interested in reading about more techniques like these.
IVO GELOV says:

September 16, 2008 at 9:03 am

Okay, let me explain it in more details. I will use PostgreSQL for examples,
hoping that this will not prevent you from understanding my examples.
First, we have this table definition:

CREATE TABLE “proba”.”entity” (
“id” INTEGER NOT NULL,
“title” TEXT NOT NULL,
“parent_id” INTEGER,
CONSTRAINT “entity_pkey” PRIMARY KEY(“id”),
CONSTRAINT “entity_chk” CHECK (parent_id id),
CONSTRAINT “parent_fk” FOREIGN KEY (“parent_id”)
REFERENCES “proba”.”entity”(“id”)
ON DELETE CASCADE
ON UPDATE CASCADE
NOT DEFERRABLE
) WITHOUT OIDS;

CREATE INDEX “parent_idx” ON “proba”.”entity”
USING btree (“parent_id”);

Suppose, that it is populated with the following data:

+—-+——–+———–+
| ID | TITLE | PARENT_ID |
+—-+——–+———–+
| 1 | RUZA | NULL |
| 2 | SVETA | 1 |
| 3 | SONIA | 2 |
| 4 | PAUL | 2 |
| 5 | TEDY | 1 |
| 6 | MARY | 5 |
| 7 | RUSLAN | 5 |
| 8 | MITKO | 7 |
| 9 | PETER | 7 |
| 10 | MILEN | NULL |
| 11 | ALEX | 10 |
| 12 | ANTON | 10 |
| 13 | BOBY | 12 |
| 14 | NADIA | 12 |
+—-+——–+———–+

It represents the followng tree:

ROOT (NULL)
|
+—– MILEN (10)
| |
| +—- ALEX (11)
| |
| +—- ANTON (12)
| |
| +—- BOBY (13)
| |
| +—- NADIA (14)
|
+—– RUZA (1)
|
+—- SVETA (2)
| |
| +—- SONIA (3)
| |
| +—- PAUL (4)
|
+—- TEDY (5)
|
+—- MARY (6)
|
+—- RUSLAN (7)
|
+—- MITKO (8)
|
+—- PETER (9)

This is a classic “Adjacency List” model (I prefer to call it “Parent-Child” model).
Now, if we add another column to the table (let’s call it PARENT_ID_2) – we can use it
to store the parents of the PARENT_ID, like this:

+—-+——–+———–+————-+
| ID | TITLE | PARENT_ID | PARENT_ID_2 |
+—-+——–+———–+————-+
| 1 | RUZA | NULL | NULL |
| 2 | SVETA | 1 | NULL |
| 3 | SONIA | 2 | 1 |
| 4 | PAUL | 2 | 1 |
| 5 | TEDY | 1 | NULL |
| 6 | MARY | 5 | 1 |
| 7 | RUSLAN | 5 | 1 |
| 8 | MITKO | 7 | 5 |
| 9 | PETER | 7 | 5 |
| 10 | MILEN | NULL | NULL |
| 11 | ALEX | 10 | NULL |
| 12 | ANTON | 10 | NULL |
| 13 | BOBY | 12 | 10 |
| 14 | NADIA | 12 | 10 |
+—-+——–+———–+————-+

We can continue this with another column to store the parents of PARENT_ID_2, and so on.
The problem here is, that we do not know in advance how many levels down the tree will go.
Of course, we can implement code to dynamically ALTER table definition to add additional
columns when needed – but this is a bad practice. The first lesson I have learned about
SQL tables is:
The columns should remain fixed and never be altered by the code. Instead, when we need
variable number of columns – transform them into rows.
This transformation is done with the help of additional table (such kind of manipulations
are also used to maintain 1:1, 1:m and n:m relationships in database theory).
I have defined this additional table in this way:

CREATE TABLE “proba”.”helper” (
“child_id” INTEGER NOT NULL,
“ancestor_id” INTEGER NOT NULL,
“depth” INTEGER NOT NULL,
CONSTRAINT “helper_idx” UNIQUE(“ancestor_id”, “child_id”),
CONSTRAINT “helper_pkey” PRIMARY KEY(“child_id”, “ancestor_id”),
CONSTRAINT “helper_chk” CHECK ((child_id ancestor_id) AND (depth > 0)),
CONSTRAINT “ancestor_fk” FOREIGN KEY (“ancestor_id”)
REFERENCES “proba”.”entity”(“id”)
ON DELETE CASCADE
ON UPDATE CASCADE
NOT DEFERRABLE,
CONSTRAINT “child_fk” FOREIGN KEY (“child_id”)
REFERENCES “proba”.”entity”(“id”)
ON DELETE CASCADE
ON UPDATE CASCADE
NOT DEFERRABLE
) WITHOUT OIDS;

CREATE INDEX “ancestor_idx” ON “proba”.”helper”
USING btree (“ancestor_id”);

CREATE INDEX “child_idx” ON “proba”.”helper”
USING btree (“child_id”);

For the already shown above example data inside table ENTITY, the data in table HELPER shuld be:

+———-+————-+——-+
| CHILD_ID | ANCESTOR_ID | DEPTH |
+———-+————-+——-+
| 3 | 1 | 2 |
| 4 | 1 | 2 |
| 6 | 1 | 2 |
| 7 | 1 | 2 |
| 8 | 5 | 2 |
| 9 | 5 | 2 |
| 13 | 10 | 2 |
| 14 | 10 | 2 |
| 8 | 1 | 3 |
| 9 | 1 | 3 |
+———-+————-+——-+

As we see, table HELPER does not contain NULL values, and lists all indirect descendants
for each node along with the distance between them (distance of 1 is for direct descendants).

After a short thinking, I decided to include
all nodes’ descendatns in table HELPER (not only indirect ones).

The number of rows in table HELPER is calculated by this formula:

N
—–
\
\ D[i]
/
/
—–
i=1

N – is the number of nodes in tree (number of rows in table ENTITY)
D[i] – is the number of edges from node “i” to the root

As far as I know, adjacency list model is very cheap for inserting and updating, but a way too
expensive for retrieving and deleting. The nested sets model (and also nested intervals) is
the opposite – very cheap for retrieving and deleting, but too expensive for inserting and
updating. So, it seems that there are two opposite options – one model for often insertions
and seldom extractions, and another model for seldom insertions and often extractions (one
for the expense of the other).
I think that with closure tables it is possible to hit 2 rabbits with a single bullet – to combine
good ones together and to shift the poor performance to a seldom used actions.
What I mean ? I still do not have any benchmarks and all the above stuff needs carefull
investigation and research, but I hope that with this tree model inserting, deleting and
retrieving nodes will be cheap operations, and only moving subtrees will be expensive – but
not so much as in other 2 models.

Here are the SQL queries for all 4 possible tree manipulations:

1. Adding new node – this is done with an INSERT-after trigger

CREATE OR REPLACE FUNCTION “proba”.”entity_ins” () RETURNS trigger AS
$body$
BEGIN
IF new.parent_id IS NOT NULL THEN
INSERT INTO proba.helper(child_id,ancestor_id,depth)
SELECT new.id,new.parent_id,1 UNION ALL
SELECT new.id,ancestor_id,depth+1 FROM proba.helper WHERE child_id=new.parent_id;
END IF;
RETURN NULL;
END;
$body$
LANGUAGE ‘plpgsql’
VOLATILE
CALLED ON NULL INPUT
SECURITY INVOKER
COST 100;

CREATE TRIGGER “entity_tr_ins” AFTER INSERT
ON “proba”.”entity” FOR EACH ROW
EXECUTE PROCEDURE “proba”.”entity_ins”();

2. Deleting a node and all of its descendants – this is handled by the cascading foreign keys.
But if you wish, it can be done like this:

DELETE FROM helper WHERE child_id=:X OR ancestor_id=:X

3. Retrieving information:

a) get all children of node X
SELECT * FROM entity WHERE id IN (
SELECT child_id FROM helper WHERE ancestor_id=:X ORDER BY depth
) ORDER BY title;
b) get only immediate (direct) descendants of node X
SELECT * FROM entity WHERE id IN (
SELECT child_id FROM helper WHERE ancestor_id=:X AND depth=1
) ORDER BY title;
c) get all ancestors of node X – from root downwards
SELECT * FROM entity WHERE id IN (
SELECT ancestor_id FROM helper WHERE child_id=:X ORDER BY depth
) ORDER BY title;
d) get path from node A to node B
SELECT title,depth FROM helper
LEFT JOIN entity ON ancestor_id=entity.id
WHERE child_id=:B AND depth
IVO GELOV says:

September 16, 2008 at 9:04 am

d) get path from node A to node B
SELECT title,depth FROM helper
LEFT JOIN entity ON ancestor_id=entity.id
WHERE child_id=:B AND depth
IVO GELOV says:

September 16, 2008 at 9:06 am

I’m sorry for the flood, but it seems that my post is too long for WordPress …

d) get path from node A to node B
SELECT title,depth FROM helper
LEFT JOIN entity ON ancestor_id=entity.id
WHERE child_id=:B AND depth
IVO GELOV says:

September 16, 2008 at 9:07 am

d) get path from node A to node B
SELECT title,depth FROM helper
LEFT JOIN entity ON ancestor_id=entity.id
WHERE child_id=:B AND depth <=
(
SELECT depth FROM helper WHERE child_id=:B AND ancestor_id=:A
) ORDER BY depth desc;

4. Updating ownership – this is done with an UPDATE-after trigger

CREATE OR REPLACE FUNCTION “proba”.”entity_del” () RETURNS trigger AS
$body$
BEGIN
— first remove edges between all old parents of node and its descendants
DELETE FROM proba.helper WHERE (child_id=old.id OR child_id IN
(SELECT child_id FROM proba.helper WHERE ancestor_id = old.id)
) AND ancestor_id IN
(SELECT ancestor_id FROM proba.helper WHERE child_id = old.id);
— then add edges for all new parents …
IF new.parent_id IS NOT NULL THEN
INSERT INTO proba.helper(child_id,ancestor_id,depth)
SELECT new.id,new.parent_id,1
UNION ALL
— … to node itself
SELECT new.id,ancestor_id,depth+1 FROM proba.helper WHERE child_id=new.parent_id
UNION ALL
— … and its descendants
(
SELECT child_id,new.parent_id,depth+1 FROM proba.helper WHERE ancestor_id=new.id
UNION ALL
SELECT child_id,ancestor_id,depth FROM
(SELECT child_id FROM proba.helper WHERE ancestor_id=new.id) AS child
CROSS JOIN
(SELECT ancestor_id,depth+2 AS depth FROM proba.helper WHERE child_id=new.parent_id) AS parent
);
END IF;
RETURN NULL;
END;
$body$
LANGUAGE ‘plpgsql’
VOLATILE
CALLED ON NULL INPUT
SECURITY INVOKER
COST 100;

CREATE TRIGGER “entity_tr_del” AFTER UPDATE
ON “proba”.”entity” FOR EACH ROW
EXECUTE PROCEDURE “proba”.”entity_del”();

Updating ownership is the most expensive operation, since it envolves a CROSS JOIN (Cartesian product).
May be someone would make a benchmarks and compare “closure table” with “Nested intervals
using Fray fractions”

Comments are closed.