Jan 13 2008

Optimize Hierarchy Queries with a Transitive Closure Table

Published by at 7:21 pm under Technology   

Last year I posted about the use of a Joe Celko-style nested set hierarchy representation, for fast hierarchy queries. Here I will describe another approach which is simpler to query, but more wasteful of space. I did not invent this transitive closure approach, I learned of it from several directions:

There are two (main) places to put the code for building a closure tables: in your application code, or in your database. The application approach is nice if you are aiming to avoid vendor-specific SQL code, but it is quite simple in SQL, and therefore not a big problem to recode for another RDBMS if the need arises. The SQL approach also avoids round-tripping the relevant data in and out of the database. Therefore, the approach I generally recommend for this is an SQL stored procedure / function.

Here is a simplified PostgreSQL stored procedure to do the job; note that his assumes a “widget” table with a widget_id and parent_id (the “adjacency” representation of a hierarchy), and a widget_closure table with fields (ancestor_id, widget_id, depth):

This sample code assumes a maximum depth of 20, and has no error checking. It will blindly miss greater depths and produce garbage if there is a “loop” in the ancestry. I recommend both arbitrary depth handling and error checking for production use.

Once your transitive closure table is populated, querying it is extremely fast and simple. You can get all the descendants of widget 12345 with the query “select widget_id from widget_closure where ancestor_id=12345″. Of course, this hierarchy representation, while simple to generate, is not simple to incrementally update as the hierarchy changes. The most straightforward way to use it is as a cache, regenerated as needed.

If you found this post useful, please link to it from your web site, mention it online, or mention it to a colleague.

6 responses so far

6 Responses to “Optimize Hierarchy Queries with a Transitive Closure Table”

  1. Alex Miller says:

    Nice. Yet another episode in the long history of trading space for time… :)

  2. Craig Buchek says:

    Thanks for sharing this. I read your article on Celko’s method, and followed up by reading the original articles by Celko. Seems like a good method in some situations, but Celko’s method would seem to have poor insert/update performance as well, due to having to update, on average, half of the records in the table.

    I’d be interested in reading about more techniques like these.

  3. IVO GELOV says:

    Okay, let me explain it in more details. I will use PostgreSQL for examples,
    hoping that this will not prevent you from understanding my examples.
    First, we have this table definition:

    CREATE TABLE “proba”.”entity” (
    “id” INTEGER NOT NULL,
    “title” TEXT NOT NULL,
    “parent_id” INTEGER,
    CONSTRAINT “entity_pkey” PRIMARY KEY(“id”),
    CONSTRAINT “entity_chk” CHECK (parent_id id),
    CONSTRAINT “parent_fk” FOREIGN KEY (“parent_id”)
    REFERENCES “proba”.”entity”(“id”)
    ON DELETE CASCADE
    ON UPDATE CASCADE
    NOT DEFERRABLE
    ) WITHOUT OIDS;

    CREATE INDEX “parent_idx” ON “proba”.”entity”
    USING btree (“parent_id”);

    Suppose, that it is populated with the following data:

    +—-+——–+———–+
    | ID | TITLE | PARENT_ID |
    +—-+——–+———–+
    | 1 | RUZA | NULL |
    | 2 | SVETA | 1 |
    | 3 | SONIA | 2 |
    | 4 | PAUL | 2 |
    | 5 | TEDY | 1 |
    | 6 | MARY | 5 |
    | 7 | RUSLAN | 5 |
    | 8 | MITKO | 7 |
    | 9 | PETER | 7 |
    | 10 | MILEN | NULL |
    | 11 | ALEX | 10 |
    | 12 | ANTON | 10 |
    | 13 | BOBY | 12 |
    | 14 | NADIA | 12 |
    +—-+——–+———–+

    It represents the followng tree:

    ROOT (NULL)
    |
    +—– MILEN (10)
    | |
    | +—- ALEX (11)
    | |
    | +—- ANTON (12)
    | |
    | +—- BOBY (13)
    | |
    | +—- NADIA (14)
    |
    +—– RUZA (1)
    |
    +—- SVETA (2)
    | |
    | +—- SONIA (3)
    | |
    | +—- PAUL (4)
    |
    +—- TEDY (5)
    |
    +—- MARY (6)
    |
    +—- RUSLAN (7)
    |
    +—- MITKO (8)
    |
    +—- PETER (9)

    This is a classic “Adjacency List” model (I prefer to call it “Parent-Child” model).
    Now, if we add another column to the table (let’s call it PARENT_ID_2) – we can use it
    to store the parents of the PARENT_ID, like this:

    +—-+——–+———–+————-+
    | ID | TITLE | PARENT_ID | PARENT_ID_2 |
    +—-+——–+———–+————-+
    | 1 | RUZA | NULL | NULL |
    | 2 | SVETA | 1 | NULL |
    | 3 | SONIA | 2 | 1 |
    | 4 | PAUL | 2 | 1 |
    | 5 | TEDY | 1 | NULL |
    | 6 | MARY | 5 | 1 |
    | 7 | RUSLAN | 5 | 1 |
    | 8 | MITKO | 7 | 5 |
    | 9 | PETER | 7 | 5 |
    | 10 | MILEN | NULL | NULL |
    | 11 | ALEX | 10 | NULL |
    | 12 | ANTON | 10 | NULL |
    | 13 | BOBY | 12 | 10 |
    | 14 | NADIA | 12 | 10 |
    +—-+——–+———–+————-+

    We can continue this with another column to store the parents of PARENT_ID_2, and so on.
    The problem here is, that we do not know in advance how many levels down the tree will go.
    Of course, we can implement code to dynamically ALTER table definition to add additional
    columns when needed – but this is a bad practice. The first lesson I have learned about
    SQL tables is:
    The columns should remain fixed and never be altered by the code. Instead, when we need
    variable number of columns – transform them into rows.
    This transformation is done with the help of additional table (such kind of manipulations
    are also used to maintain 1:1, 1:m and n:m relationships in database theory).
    I have defined this additional table in this way:

    CREATE TABLE “proba”.”helper” (
    “child_id” INTEGER NOT NULL,
    “ancestor_id” INTEGER NOT NULL,
    “depth” INTEGER NOT NULL,
    CONSTRAINT “helper_idx” UNIQUE(“ancestor_id”, “child_id”),
    CONSTRAINT “helper_pkey” PRIMARY KEY(“child_id”, “ancestor_id”),
    CONSTRAINT “helper_chk” CHECK ((child_id ancestor_id) AND (depth > 0)),
    CONSTRAINT “ancestor_fk” FOREIGN KEY (“ancestor_id”)
    REFERENCES “proba”.”entity”(“id”)
    ON DELETE CASCADE
    ON UPDATE CASCADE
    NOT DEFERRABLE,
    CONSTRAINT “child_fk” FOREIGN KEY (“child_id”)
    REFERENCES “proba”.”entity”(“id”)
    ON DELETE CASCADE
    ON UPDATE CASCADE
    NOT DEFERRABLE
    ) WITHOUT OIDS;

    CREATE INDEX “ancestor_idx” ON “proba”.”helper”
    USING btree (“ancestor_id”);

    CREATE INDEX “child_idx” ON “proba”.”helper”
    USING btree (“child_id”);

    For the already shown above example data inside table ENTITY, the data in table HELPER shuld be:

    +———-+————-+——-+
    | CHILD_ID | ANCESTOR_ID | DEPTH |
    +———-+————-+——-+
    | 3 | 1 | 2 |
    | 4 | 1 | 2 |
    | 6 | 1 | 2 |
    | 7 | 1 | 2 |
    | 8 | 5 | 2 |
    | 9 | 5 | 2 |
    | 13 | 10 | 2 |
    | 14 | 10 | 2 |
    | 8 | 1 | 3 |
    | 9 | 1 | 3 |
    +———-+————-+——-+

    As we see, table HELPER does not contain NULL values, and lists all indirect descendants
    for each node along with the distance between them (distance of 1 is for direct descendants).

    After a short thinking, I decided to include
    all nodes’ descendatns in table HELPER (not only indirect ones).

    The number of rows in table HELPER is calculated by this formula:

    N
    —–
    \
    \ D[i]
    /
    /
    —–
    i=1

    N – is the number of nodes in tree (number of rows in table ENTITY)
    D[i] – is the number of edges from node “i” to the root

    As far as I know, adjacency list model is very cheap for inserting and updating, but a way too
    expensive for retrieving and deleting. The nested sets model (and also nested intervals) is
    the opposite – very cheap for retrieving and deleting, but too expensive for inserting and
    updating. So, it seems that there are two opposite options – one model for often insertions
    and seldom extractions, and another model for seldom insertions and often extractions (one
    for the expense of the other).
    I think that with closure tables it is possible to hit 2 rabbits with a single bullet – to combine
    good ones together and to shift the poor performance to a seldom used actions.
    What I mean ? I still do not have any benchmarks and all the above stuff needs carefull
    investigation and research, but I hope that with this tree model inserting, deleting and
    retrieving nodes will be cheap operations, and only moving subtrees will be expensive – but
    not so much as in other 2 models.

    Here are the SQL queries for all 4 possible tree manipulations:

    1. Adding new node – this is done with an INSERT-after trigger

    CREATE OR REPLACE FUNCTION “proba”.”entity_ins” () RETURNS trigger AS
    $body$
    BEGIN
    IF new.parent_id IS NOT NULL THEN
    INSERT INTO proba.helper(child_id,ancestor_id,depth)
    SELECT new.id,new.parent_id,1 UNION ALL
    SELECT new.id,ancestor_id,depth+1 FROM proba.helper WHERE child_id=new.parent_id;
    END IF;
    RETURN NULL;
    END;
    $body$
    LANGUAGE ‘plpgsql’
    VOLATILE
    CALLED ON NULL INPUT
    SECURITY INVOKER
    COST 100;

    CREATE TRIGGER “entity_tr_ins” AFTER INSERT
    ON “proba”.”entity” FOR EACH ROW
    EXECUTE PROCEDURE “proba”.”entity_ins”();

    2. Deleting a node and all of its descendants – this is handled by the cascading foreign keys.
    But if you wish, it can be done like this:

    DELETE FROM helper WHERE child_id=:X OR ancestor_id=:X

    3. Retrieving information:

    a) get all children of node X
    SELECT * FROM entity WHERE id IN (
    SELECT child_id FROM helper WHERE ancestor_id=:X ORDER BY depth
    ) ORDER BY title;
    b) get only immediate (direct) descendants of node X
    SELECT * FROM entity WHERE id IN (
    SELECT child_id FROM helper WHERE ancestor_id=:X AND depth=1
    ) ORDER BY title;
    c) get all ancestors of node X – from root downwards
    SELECT * FROM entity WHERE id IN (
    SELECT ancestor_id FROM helper WHERE child_id=:X ORDER BY depth
    ) ORDER BY title;
    d) get path from node A to node B
    SELECT title,depth FROM helper
    LEFT JOIN entity ON ancestor_id=entity.id
    WHERE child_id=:B AND depth

  4. IVO GELOV says:

    d) get path from node A to node B
    SELECT title,depth FROM helper
    LEFT JOIN entity ON ancestor_id=entity.id
    WHERE child_id=:B AND depth

  5. IVO GELOV says:

    I’m sorry for the flood, but it seems that my post is too long for WordPress …

    d) get path from node A to node B
    SELECT title,depth FROM helper
    LEFT JOIN entity ON ancestor_id=entity.id
    WHERE child_id=:B AND depth

  6. IVO GELOV says:

    d) get path from node A to node B
    SELECT title,depth FROM helper
    LEFT JOIN entity ON ancestor_id=entity.id
    WHERE child_id=:B AND depth <=
    (
    SELECT depth FROM helper WHERE child_id=:B AND ancestor_id=:A
    ) ORDER BY depth desc;

    4. Updating ownership – this is done with an UPDATE-after trigger

    CREATE OR REPLACE FUNCTION “proba”.”entity_del” () RETURNS trigger AS
    $body$
    BEGIN
    – first remove edges between all old parents of node and its descendants
    DELETE FROM proba.helper WHERE (child_id=old.id OR child_id IN
    (SELECT child_id FROM proba.helper WHERE ancestor_id = old.id)
    ) AND ancestor_id IN
    (SELECT ancestor_id FROM proba.helper WHERE child_id = old.id);
    — then add edges for all new parents …
    IF new.parent_id IS NOT NULL THEN
    INSERT INTO proba.helper(child_id,ancestor_id,depth)
    SELECT new.id,new.parent_id,1
    UNION ALL
    — … to node itself
    SELECT new.id,ancestor_id,depth+1 FROM proba.helper WHERE child_id=new.parent_id
    UNION ALL
    — … and its descendants
    (
    SELECT child_id,new.parent_id,depth+1 FROM proba.helper WHERE ancestor_id=new.id
    UNION ALL
    SELECT child_id,ancestor_id,depth FROM
    (SELECT child_id FROM proba.helper WHERE ancestor_id=new.id) AS child
    CROSS JOIN
    (SELECT ancestor_id,depth+2 AS depth FROM proba.helper WHERE child_id=new.parent_id) AS parent
    );
    END IF;
    RETURN NULL;
    END;
    $body$
    LANGUAGE ‘plpgsql’
    VOLATILE
    CALLED ON NULL INPUT
    SECURITY INVOKER
    COST 100;

    CREATE TRIGGER “entity_tr_del” AFTER UPDATE
    ON “proba”.”entity” FOR EACH ROW
    EXECUTE PROCEDURE “proba”.”entity_del”();

    Updating ownership is the most expensive operation, since it envolves a CROSS JOIN (Cartesian product).
    May be someone would make a benchmarks and compare “closure table” with “Nested intervals
    using Fray fractions”