What is the most efficient/elegant way to parse a flat table into a tree?

128,693

Solution 1

Now that MySQL 8.0 supports recursive queries, we can say that all popular SQL databases support recursive queries in standard syntax.

WITH RECURSIVE MyTree AS (
    SELECT * FROM MyTable WHERE ParentId IS NULL
    UNION ALL
    SELECT m.* FROM MyTABLE AS m JOIN MyTree AS t ON m.ParentId = t.Id
)
SELECT * FROM MyTree;

I tested recursive queries in MySQL 8.0 in my presentation Recursive Query Throwdown in 2017.

Below is my original answer from 2008:


There are several ways to store tree-structured data in a relational database. What you show in your example uses two methods:

  • Adjacency List (the "parent" column) and
  • Path Enumeration (the dotted-numbers in your name column).

Another solution is called Nested Sets, and it can be stored in the same table too. Read "Trees and Hierarchies in SQL for Smarties" by Joe Celko for a lot more information on these designs.

I usually prefer a design called Closure Table (aka "Adjacency Relation") for storing tree-structured data. It requires another table, but then querying trees is pretty easy.

I cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP and in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

CREATE TABLE ClosureTable (
  ancestor_id   INT NOT NULL REFERENCES FlatTable(id),
  descendant_id INT NOT NULL REFERENCES FlatTable(id),
  PRIMARY KEY (ancestor_id, descendant_id)
);

Store all paths in the Closure Table, where there is a direct ancestry from one node to another. Include a row for each node to reference itself. For example, using the data set you showed in your question:

INSERT INTO ClosureTable (ancestor_id, descendant_id) VALUES
  (1,1), (1,2), (1,4), (1,6),
  (2,2), (2,4),
  (3,3), (3,5),
  (4,4),
  (5,5),
  (6,6);

Now you can get a tree starting at node 1 like this:

SELECT f.* 
FROM FlatTable f 
  JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1;

The output (in MySQL client) looks like the following:

+----+
| id |
+----+
|  1 | 
|  2 | 
|  4 | 
|  6 | 
+----+

In other words, nodes 3 and 5 are excluded, because they're part of a separate hierarchy, not descending from node 1.


Re: comment from e-satis about immediate children (or immediate parent). You can add a "path_length" column to the ClosureTable to make it easier to query specifically for an immediate child or parent (or any other distance).

INSERT INTO ClosureTable (ancestor_id, descendant_id, path_length) VALUES
  (1,1,0), (1,2,1), (1,4,2), (1,6,1),
  (2,2,0), (2,4,1),
  (3,3,0), (3,5,1),
  (4,4,0),
  (5,5,0),
  (6,6,0);

Then you can add a term in your search for querying the immediate children of a given node. These are descendants whose path_length is 1.

SELECT f.* 
FROM FlatTable f 
  JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
  AND path_length = 1;

+----+
| id |
+----+
|  2 | 
|  6 | 
+----+

Re comment from @ashraf: "How about sorting the whole tree [by name]?"

Here's an example query to return all nodes that are descendants of node 1, join them to the FlatTable that contains other node attributes such as name, and sort by the name.

SELECT f.name
FROM FlatTable f 
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
ORDER BY f.name;

Re comment from @Nate:

SELECT f.name, GROUP_CONCAT(b.ancestor_id order by b.path_length desc) AS breadcrumbs
FROM FlatTable f 
JOIN ClosureTable a ON (f.id = a.descendant_id) 
JOIN ClosureTable b ON (b.descendant_id = a.descendant_id) 
WHERE a.ancestor_id = 1 
GROUP BY a.descendant_id 
ORDER BY f.name

+------------+-------------+
| name       | breadcrumbs |
+------------+-------------+
| Node 1     | 1           |
| Node 1.1   | 1,2         |
| Node 1.1.1 | 1,2,4       |
| Node 1.2   | 1,6         |
+------------+-------------+

A user suggested an edit today. SO moderators approved the edit, but I am reversing it.

The edit suggested that the ORDER BY in the last query above should be ORDER BY b.path_length, f.name, presumably to make sure the ordering matches the hierarchy. But this doesn't work, because it would order "Node 1.1.1" after "Node 1.2".

If you want the ordering to match the hierarchy in a sensible way, that is possible, but not simply by ordering by the path length. For example, see my answer to MySQL Closure Table hierarchical database - How to pull information out in the correct order.

Solution 2

If you use nested sets (sometimes referred to as Modified Pre-order Tree Traversal) you can extract the entire tree structure or any subtree within it in tree order with a single query, at the cost of inserts being more expensive, as you need to manage columns which describe an in-order path through thee tree structure.

For django-mptt, I used a structure like this:

id  parent_id  tree_id  level  lft  rght
--  ---------  -------  -----  ---  ----
 1       null        1      0    1    14
 2          1        1      1    2     7
 3          2        1      2    3     4
 4          2        1      2    5     6
 5          1        1      1    8    13
 6          5        1      2    9    10
 7          5        1      2    11   12

Which describes a tree which looks like this (with id representing each item):

 1
 +-- 2
 |   +-- 3
 |   +-- 4
 |
 +-- 5
     +-- 6
     +-- 7

Or, as a nested set diagram which makes it more obvious how the lft and rght values work:

 __________________________________________________________________________
|  Root 1                                                                  |
|   ________________________________    ________________________________   |
|  |  Child 1.1                     |  |  Child 1.2                     |  |
|  |   ___________    ___________   |  |   ___________    ___________   |  |
|  |  |  C 1.1.1  |  |  C 1.1.2  |  |  |  |  C 1.2.1  |  |  C 1.2.2  |  |  |
1  2  3___________4  5___________6  7  8  9___________10 11__________12 13 14
|  |________________________________|  |________________________________|  |
|__________________________________________________________________________|

As you can see, to get the entire subtree for a given node, in tree order, you simply have to select all rows which have lft and rght values between its lft and rght values. It's also simple to retrieve the tree of ancestors for a given node.

The level column is a bit of denormalisation for convenience more than anything and the tree_id column allows you to restart the lft and rght numbering for each top-level node, which reduces the number of columns affected by inserts, moves and deletions, as the lft and rght columns have to be adjusted accordingly when these operations take place in order to create or close gaps. I made some development notes at the time when I was trying to wrap my head around the queries required for each operation.

In terms of actually working with this data to display a tree, I created a tree_item_iterator utility function which, for each node, should give you sufficient information to generate whatever kind of display you want.

More info about MPTT:

Solution 3

It's a quite old question, but as it's got many views I think it's worth to present an alternative, and in my opinion very elegant, solution.

In order to read a tree structure you can use recursive Common Table Expressions (CTEs). It gives a possibility to fetch whole tree structure at once, have the information about the level of the node, its parent node and order within children of the parent node.

Let me show you how this would work in PostgreSQL 9.1.

  1. Create a structure

    CREATE TABLE tree (
        id int  NOT NULL,
        name varchar(32)  NOT NULL,
        parent_id int  NULL,
        node_order int  NOT NULL,
        CONSTRAINT tree_pk PRIMARY KEY (id),
        CONSTRAINT tree_tree_fk FOREIGN KEY (parent_id) 
          REFERENCES tree (id) NOT DEFERRABLE
    );
    
    
    insert into tree values
      (0, 'ROOT', NULL, 0),
      (1, 'Node 1', 0, 10),
      (2, 'Node 1.1', 1, 10),
      (3, 'Node 2', 0, 20),
      (4, 'Node 1.1.1', 2, 10),
      (5, 'Node 2.1', 3, 10),
      (6, 'Node 1.2', 1, 20);
    
  2. Write a query

    WITH RECURSIVE 
    tree_search (id, name, level, parent_id, node_order) AS (
      SELECT 
        id, 
        name,
        0,
        parent_id, 
        1 
      FROM tree
      WHERE parent_id is NULL
    
      UNION ALL 
      SELECT 
        t.id, 
        t.name,
        ts.level + 1, 
        ts.id, 
        t.node_order 
      FROM tree t, tree_search ts 
      WHERE t.parent_id = ts.id 
    ) 
    SELECT * FROM tree_search 
    WHERE level > 0 
    ORDER BY level, parent_id, node_order;
    

    Here are the results:

     id |    name    | level | parent_id | node_order 
    ----+------------+-------+-----------+------------
      1 | Node 1     |     1 |         0 |         10
      3 | Node 2     |     1 |         0 |         20
      2 | Node 1.1   |     2 |         1 |         10
      6 | Node 1.2   |     2 |         1 |         20
      5 | Node 2.1   |     2 |         3 |         10
      4 | Node 1.1.1 |     3 |         2 |         10
    (6 rows)
    

    The tree nodes are ordered by a level of depth. In the final output we would present them in the subsequent lines.

    For each level, they are ordered by parent_id and node_order within the parent. This tells us how to present them in the output - link node to the parent in this order.

    Having such a structure it wouldn't be difficult to make a really nice presentation in HTML.

    Recursive CTEs are available in PostgreSQL, IBM DB2, MS SQL Server and Oracle.

    If you'd like to read more on recursive SQL queries, you can either check the documentation of your favourite DBMS or read my two articles covering this topic:

Solution 4

As of Oracle 9i, you can use CONNECT BY.

SELECT LPAD(' ', (LEVEL - 1) * 4) || "Name" AS "Name"
FROM (SELECT * FROM TMP_NODE ORDER BY "Order")
CONNECT BY PRIOR "Id" = "ParentId"
START WITH "Id" IN (SELECT "Id" FROM TMP_NODE WHERE "ParentId" = 0)

As of SQL Server 2005, you can use a recursive common table expression (CTE).

WITH [NodeList] (
  [Id]
  , [ParentId]
  , [Level]
  , [Order]
) AS (
  SELECT [Node].[Id]
    , [Node].[ParentId]
    , 0 AS [Level]
    , CONVERT([varchar](MAX), [Node].[Order]) AS [Order]
  FROM [Node]
  WHERE [Node].[ParentId] = 0
  UNION ALL
  SELECT [Node].[Id]
    , [Node].[ParentId]
    , [NodeList].[Level] + 1 AS [Level]
    , [NodeList].[Order] + '|'
      + CONVERT([varchar](MAX), [Node].[Order]) AS [Order]
  FROM [Node]
    INNER JOIN [NodeList] ON [NodeList].[Id] = [Node].[ParentId]
) SELECT REPLICATE(' ', [NodeList].[Level] * 4) + [Node].[Name] AS [Name]
FROM [Node]
  INNER JOIN [NodeList] ON [NodeList].[Id] = [Node].[Id]
ORDER BY [NodeList].[Order]

Both will output the following results.

Name
'Node 1'
'    Node 1.1'
'        Node 1.1.1'
'    Node 1.2'
'Node 2'
'    Node 2.1'

Solution 5

Bill's answer is pretty gosh-darned good, this answer adds some things to it which makes me wish SO supported threaded answers.

Anyway I wanted to support the tree structure and the Order property. I included a single property in each Node called leftSibling that does the same thing Order is meant to do in the original question (maintain left-to-right order).

mysql> desc nodes ;
+-------------+--------------+------+-----+---------+----------------+
| Field       | Type         | Null | Key | Default | Extra          |
+-------------+--------------+------+-----+---------+----------------+
| id          | int(11)      | NO   | PRI | NULL    | auto_increment |
| name        | varchar(255) | YES  |     | NULL    |                |
| leftSibling | int(11)      | NO   |     | 0       |                |
+-------------+--------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

mysql> desc adjacencies;
+------------+---------+------+-----+---------+----------------+
| Field      | Type    | Null | Key | Default | Extra          |
+------------+---------+------+-----+---------+----------------+
| relationId | int(11) | NO   | PRI | NULL    | auto_increment |
| parent     | int(11) | NO   |     | NULL    |                |
| child      | int(11) | NO   |     | NULL    |                |
| pathLen    | int(11) | NO   |     | NULL    |                |
+------------+---------+------+-----+---------+----------------+
4 rows in set (0.00 sec)

More detail and SQL code on my blog.

Thanks Bill your answer was helpful in getting started!

Share:
128,693
Tomalak
Author by

Tomalak

I know a bit about SQL, Regular Expressions, XSLT, ColdFusion, JavaScript, Python, PowerShell and scripting languages in general. I also have a Unicorn. :) Further I have a Twitter account I update once in a while, but you won't read much technical stuff there.

Updated on July 13, 2022

Comments

  • Tomalak
    Tomalak almost 2 years

    Assume you have a flat table that stores an ordered tree hierarchy:

    Id   Name         ParentId   Order
     1   'Node 1'            0      10
     2   'Node 1.1'          1      10
     3   'Node 2'            0      20
     4   'Node 1.1.1'        2      10
     5   'Node 2.1'          3      10
     6   'Node 1.2'          1      20
    

    Here's a diagram, where we have [id] Name. Root node 0 is fictional.

                           [0] ROOT
                              /    \ 
                  [1] Node 1          [3] Node 2
                  /       \                   \
        [2] Node 1.1     [6] Node 1.2      [5] Node 2.1
              /          
     [4] Node 1.1.1
    

    What minimalistic approach would you use to output that to HTML (or text, for that matter) as a correctly ordered, correctly indented tree?

    Assume further you only have basic data structures (arrays and hashmaps), no fancy objects with parent/children references, no ORM, no framework, just your two hands. The table is represented as a result set, which can be accessed randomly.

    Pseudo code or plain English is okay, this is purely a conceptional question.

    Bonus question: Is there a fundamentally better way to store a tree structure like this in a RDBMS?


    EDITS AND ADDITIONS

    To answer one commenter's (Mark Bessey's) question: A root node is not necessary, because it is never going to be displayed anyway. ParentId = 0 is the convention to express "these are top level". The Order column defines how nodes with the same parent are going to be sorted.

    The "result set" I spoke of can be pictured as an array of hashmaps (to stay in that terminology). For my example was meant to be already there. Some answers go the extra mile and construct it first, but thats okay.

    The tree can be arbitrarily deep. Each node can have N children. I did not exactly have a "millions of entries" tree in mind, though.

    Don't mistake my choice of node naming ('Node 1.1.1') for something to rely on. The nodes could equally well be called 'Frank' or 'Bob', no naming structure is implied, this was merely to make it readable.

    I have posted my own solution so you guys can pull it to pieces.

  • Tomalak
    Tomalak over 15 years
    This is very elegant, thank you. Bonus point awarded. ;-) I see one small drawback though - as it stores the child relation explicitly and implicitly, you need to do a lot of careful UPDATEing for even a small shift in the tree structure.
  • Tomalak
    Tomalak over 15 years
    That is what I meant with "no frameworks" - you are using LINQ, aren't you? Regarding your first paragraph: The result set is already there, why copying all the info to a new object structure first? (I was not clear enough on that fact, sorry)
  • Tomalak
    Tomalak over 15 years
    I'd rather not change the DB layout just because a new level of sub-nodes is needed. :-)
  • Bill Karwin
    Bill Karwin over 15 years
    True, every method of storing tree-structures in a database requires some work, either when creating or updating the tree, or when querying trees and subtrees. Choose the design based on which you'd like to be simpler: writing or reading.
  • Bill Karwin
    Bill Karwin over 15 years
    The "parent" design you showed in your question is easy to maintain, and easy to query if you only need immediate parent or immediate child. But not possible to get a full tree in one query, unless you use proprietary syntax like Oracle's CONNECT BY.
  • Oli
    Oli over 15 years
    Tomalak - no the code is pseudo-code. Of course you'd have to break things out into proper selects and iterators... and a real syntax! Why OOP? Because you can exactly mirror the structure. It keeps things nice and it just happens to be more efficient (only one select)
  • Tomalak
    Tomalak over 15 years
    I had no repeated selects in mind either. Regarding OOP: Mark Bessey said in his answer: "You can emulate any other data structure with a hashmap, so that's not a terrible limitation.". Your solution is correct, but I think there is some room fore improvement even without OOP.
  • matt b
    matt b over 15 years
    I see what you mean now, if it's not obvious the main algorithm is in NodeBuilder.build() - I probably could have done a better job of summing this up.
  • e-satis
    e-satis about 15 years
    Very good and usefull trick. I used and enjoyed it. +1, +10 if I could. But there is something you should add that's seems very important to me : you can't identify the first direct children with only a closure table. You need the closure table AND the adjacency list to work together to performs task like adjusting behaviour of a list of item according to the state of the children.
  • Klara_P
    Klara_P about 12 years
    @BillKarwin I'm curious about whether it's possible to sort the entire tree by node and by name -- in your example to @ashraf it sorts an entire tree, but it loses the structure. You allude to using a pathlength column to assist in sorting, in a blog post; can you elaborate on that? I might end up doing it in my scripting code, but I'm curious about the possibility of doing it in SQL.
  • Bill Karwin
    Bill Karwin about 12 years
    @Pauld'Aoust, thanks, I've added an example in a comment to my blog to which you linked.
  • Klara_P
    Klara_P about 12 years
    @BillKarwin Oop, I misunderstood your purpose of pathlength for use in sorting -- I thought it was for sorting sibling nodes rather than sorting paths. I was awake at 4:30 this morning figuring out how to sort siblings in a whole tree, and I think I determined that I have to do it in my application code.
  • Bill Karwin
    Bill Karwin about 12 years
    @Pauld'Aoust, right, I haven't found a really elegant way of sorting siblings in any of the various ways of storing hierarchical data. That's something I'd like to find a solution for.
  • user
    user almost 10 years
    @BillKarwin Your answers & presentation on Closure tables are very informative. Other than storage requirements, what is the downside of closure table (in terms of query time/complexity)?
  • Bill Karwin
    Bill Karwin almost 10 years
    @buffer, there's a chance to create inconsistencies as you create all the rows for a hierarchy. Adjacency List (parent_id) only has one row to express each parent-child relationship, but Closure Table has many.
  • user
    user almost 10 years
    @BillKarwin One more thing, are Closure Tables suitable for a graph with multiple paths to any given node (e.g. a category hierarchy where any leaf or non-leaf node may belong to more than one parent)
  • Bill Karwin
    Bill Karwin almost 10 years
    @buffer, I have not experimented with that so I can't answer.
  • Nisar
    Nisar almost 10 years
    cte can be used in both sqlserver and oracle @Eric Weilnau
  • Bill Tutt
    Bill Tutt over 9 years
    @buffer, as long as the directed graph is acyclic (has no cycles) the closure table approach will work. However, you'll end up having lots more rows in the closure table since the closure table is effectively all of the possible paths from the starting node. Trees after all are just a special case of a directed acyclic graph.
  • Bill Karwin
    Bill Karwin over 9 years
    @BillTutt, the problem with representing a DAG this way is that queries can find multiple paths to the same leaf node. That results in nodes being output in the result set redundantly, with multiple different paths. It's even unclear what is the path length to a given node. But this is expected, because many things about a DAG are more complex than a simple tree.
  • carlos a.
    carlos a. almost 9 years
    Good day, i got a question, and is how to migrate from Adjacency List table to a Closure, i means what is the query to do that, because in the examples the insert statement already have ancestor and descendant information and would be amazing know how to map correctly the data from "parent list" to "tree data" using querys. Thanks in Advance.
  • Bill Karwin
    Bill Karwin almost 9 years
    @carlosa. Short answer is: you have to write a program. There's no way to do it in a single SQL statement. stackoverflow.com/questions/12621873/…
  • Moslem Hadi
    Moslem Hadi over 8 years
    why do you add (1,1) or (2,2) to ClosureTable? Is it necessary?
  • Bill Karwin
    Bill Karwin over 8 years
    @Reza, so that if you add a new child node, you can query for all descendants of (1) and those are the ancestors of the new child.
  • Josef.B
    Josef.B over 8 years
    Having read of the complexities and perf of SQL and "non-table" structures, this was my first thought too, nosql. Of course, there are so many issues to exporting, etc. Plus, the OP mentioned only tables. Oh well. I'm not a DB expert, as is obvious.
  • Billy
    Billy about 8 years
    Doing research on tree based data led me to this post, which in turn lead me to your book. I don't really work with much advanced SQL, so I've had a hard time knowing whats realistic in terms of how much is feasible/timely in a single query. I've been toying with your comment SQL structure from the book but mixed in depth field you mention in this post. Could one fetch 30 top level comments and 2 comment replies both sorted by comment_date in a timely query. Or would it be better to just query for 30 top level comments sorted by time and then query for sorted sub comments separately?
  • Srikan
    Srikan over 7 years
    What if we want to get the subtrees by value. Like a node with in a subtree has highest value, we want to get that tree first ?
  • Bill Karwin
    Bill Karwin over 7 years
    @Srikan, sure, what's stopping you? Query for the node with the max value, then you have the node id. Then you query for the tree containing that node.
  • Noumenon
    Noumenon over 6 years
    I guess SQL Server is not a "popular database", because it doesn't support this standard syntax (no "RECURSIVE", must use "UNION ALL").
  • Bill Karwin
    Bill Karwin over 6 years
    @Noumenon, yes, for some reason unknown to me, Microsoft chose not to use the RECURSIVE keyword, but otherwise they do support recursive CTE syntax.
  • orustammanapov
    orustammanapov about 6 years
    I wish we would stop using abbreviations like lftand rght for column names, I mean how many characters we didn't have to type? one?!
  • Orkhan Alikhanov
    Orkhan Alikhanov almost 6 years
    @BillKarwin Could you please mention your Recursive Query Throwdown presentation in the post? It made me very comfortable with MySQL 8 but it was hard to find.
  • Bill Karwin
    Bill Karwin almost 6 years
    @OrkhanAlikhanov, thank you, good suggestion. I have added a link.
  • Milan Chheda
    Milan Chheda about 5 years
    @BillKarwin What if we have values as 4 and 5. Can we recursively identify/create the path? So ideally, if 4 and 5 are provided, we should be able to see the same diagram that OP has provided.
  • meecect
    meecect about 3 years
    It's because 'left' and 'right' are reserved words in SQL
  • Rick James
    Rick James over 2 years
    Recursive CTEs are also available in MariaDB 10.2.
  • Dan
    Dan over 2 years
    My ClosureTable has data like (1,2) (2,3) (3,4) and I want to be able to retrieve 2,3,4 when I search for 1. I don't have mysql 8 so I don't know anything about recursive queries, but is that the only way to get what I want in a single query? The above answer requires an entry to specifically link the 1 with each of the other numbers and in my situation, that is not practical as there are many many cross-connections.