What are the options for storing hierarchical data in a relational database?

302,109

Solution 1

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

Solution 2

Adjacency Model + Nested Sets Model

I went for it because I could insert new items to the tree easily (you just need a branch's id to insert a new item to it) and also query it quite fast.

+-------------+----------------------+--------+-----+-----+
| category_id | name                 | parent | lft | rgt |
+-------------+----------------------+--------+-----+-----+
|           1 | ELECTRONICS          |   NULL |   1 |  20 |
|           2 | TELEVISIONS          |      1 |   2 |   9 |
|           3 | TUBE                 |      2 |   3 |   4 |
|           4 | LCD                  |      2 |   5 |   6 |
|           5 | PLASMA               |      2 |   7 |   8 |
|           6 | PORTABLE ELECTRONICS |      1 |  10 |  19 |
|           7 | MP3 PLAYERS          |      6 |  11 |  14 |
|           8 | FLASH                |      7 |  12 |  13 |
|           9 | CD PLAYERS           |      6 |  15 |  16 |
|          10 | 2 WAY RADIOS         |      6 |  17 |  18 |
+-------------+----------------------+--------+-----+-----+
  • Every time you need all children of any parent you just query the parent column.
  • If you needed all descendants of any parent you query for items which have their lft between lft and rgt of parent.
  • If you needed all parents of any node up to the root of the tree, you query for items having lft lower than the node's lft and rgt bigger than the node's rgt and sort the by parent.

I needed to make accessing and querying the tree faster than inserts, that's why I chose this

The only problem is to fix the left and right columns when inserting new items. well I created a stored procedure for it and called it every time I inserted a new item which was rare in my case but it is really fast. I got the idea from the Joe Celko's book, and the stored procedure and how I came up with it is explained here in DBA SE https://dba.stackexchange.com/q/89051/41481

Solution 3

This is a very partial answer to your question, but I hope still useful.

Microsoft SQL Server 2008 implements two features that are extremely useful for managing hierarchical data:

  • the HierarchyId data type.
  • common table expressions, using the with keyword.

Have a look at "Model Your Data Hierarchies With SQL Server 2008" by Kent Tegels on MSDN for starts. See also my own question: Recursive same-table query in SQL Server 2008

Solution 4

This design was not mentioned yet:

Multiple lineage columns

Though it has limitations, if you can bear them, it's very simple and very efficient. Features:

  • Columns: one for each lineage level, refers to all the parents up to the root, levels below the current items' level are set to 0 (or NULL)
  • There is a fixed limit to how deep the hierarchy can be
  • Cheap ancestors, descendants, level
  • Cheap insert, delete, move of the leaves
  • Expensive insert, delete, move of the internal nodes

Here follows an example - taxonomic tree of birds so the hierarchy is Class/Order/Family/Genus/Species - species is the lowest level, 1 row = 1 taxon (which corresponds to species in the case of the leaf nodes):

CREATE TABLE `taxons` (
  `TaxonId` smallint(6) NOT NULL default '0',
  `ClassId` smallint(6) default NULL,
  `OrderId` smallint(6) default NULL,
  `FamilyId` smallint(6) default NULL,
  `GenusId` smallint(6) default NULL,
  `Name` varchar(150) NOT NULL default ''
);

and the example of the data:

+---------+---------+---------+----------+---------+-------------------------------+
| TaxonId | ClassId | OrderId | FamilyId | GenusId | Name                          |
+---------+---------+---------+----------+---------+-------------------------------+
|     254 |       0 |       0 |        0 |       0 | Aves                          |
|     255 |     254 |       0 |        0 |       0 | Gaviiformes                   |
|     256 |     254 |     255 |        0 |       0 | Gaviidae                      |
|     257 |     254 |     255 |      256 |       0 | Gavia                         |
|     258 |     254 |     255 |      256 |     257 | Gavia stellata                |
|     259 |     254 |     255 |      256 |     257 | Gavia arctica                 |
|     260 |     254 |     255 |      256 |     257 | Gavia immer                   |
|     261 |     254 |     255 |      256 |     257 | Gavia adamsii                 |
|     262 |     254 |       0 |        0 |       0 | Podicipediformes              |
|     263 |     254 |     262 |        0 |       0 | Podicipedidae                 |
|     264 |     254 |     262 |      263 |       0 | Tachybaptus                   |

This is great because this way you accomplish all the needed operations in a very easy way, as long as the internal categories don't change their level in the tree.

Solution 5

If your database supports arrays, you can also implement a lineage column or materialized path as an array of parent ids.

Specifically with Postgres you can then use the set operators to query the hierarchy, and get excellent performance with GIN indices. This makes finding parents, children, and depth pretty trivial in a single query. Updates are pretty manageable as well.

I have a full write up of using arrays for materialized paths if you're curious.

Share:
302,109
orangepips
Author by

orangepips

Updated on July 08, 2022

Comments

  • orangepips
    orangepips almost 2 years

    Good Overviews

    Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:

    Options

    Ones I am aware of and general features:

    1. Adjacency List:
    • Columns: ID, ParentID
    • Easy to implement.
    • Cheap node moves, inserts, and deletes.
    • Expensive to find the level, ancestry & descendants, path
    • Avoid N+1 via Common Table Expressions in databases that support them
    1. Nested Set (a.k.a Modified Preorder Tree Traversal)
    • Columns: Left, Right
    • Cheap ancestry, descendants
    • Very expensive O(n/2) moves, inserts, deletes due to volatile encoding
    1. Bridge Table (a.k.a. Closure Table /w triggers)
    • Uses separate join table with ancestor, descendant, depth (optional)
    • Cheap ancestry and descendants
    • Writes costs O(log n) (size of the subtree) for insert, updates, deletes
    • Normalized encoding: good for RDBMS statistics & query planner in joins
    • Requires multiple rows per node
    1. Lineage Column (a.k.a. Materialized Path, Path Enumeration)
    • Column: lineage (e.g. /parent/child/grandchild/etc...)
    • Cheap descendants via prefix query (e.g. LEFT(lineage, #) = '/enumerated/path')
    • Writes costs O(log n) (size of the subtree) for insert, updates, deletes
    • Non-relational: relies on Array datatype or serialized string format
    1. Nested Intervals
    • Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
    • Has real/float/decimal representation/precision issues
    • Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with the added trickiness of linear algebra.
    1. Flat Table
    • A modified Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
    • Cheap to iterate/paginate over
    • Expensive move and delete
    • Good Use: threaded discussion - forums / blog comments
    1. Multiple lineage columns
    • Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL
    • Cheap ancestors, descendants, level
    • Cheap insert, delete, move of the leaves
    • Expensive insert, delete, move of the internal nodes
    • Hard limit to how deep the hierarchy can be

    Database Specific Notes

    MySQL

    Oracle

    PostgreSQL

    SQL Server

    • General summary
    • 2008 offers HierarchyId data type that appears to help with the Lineage Column approach and expand the depth that can be represented.
    • Gili
      Gili over 11 years
      According to slideshare.net/billkarwin/sql-antipatterns-strike-back page 77, Closure Tables are superior to Adjacency List, Path Enumeration and Nested Sets in terms of ease of use (and I'm guessing performance as well).
    • Lothar
      Lothar over 8 years
      I miss a very simple version here: a simple BLOB. If your hierarchy only has a few dozend items a serialized tree of id's might be the best option.
    • orangepips
      orangepips over 8 years
      @Lothar: question is a community wiki so feel free to have at it. My thought in that regard is I would only do it with those databases that support some sort of blob structuring such as XML with a stable query language such as XPATH. Otherwise I don't see a good way of querying aside from retrieve, deserialize, and munge in code, not SQL. And if you really have a problem where you need a lot of arbitrary elements you might be better off using Node database like Neo4J, which I've used and liked, albeit never taken through to production.
    • Vadim Loboda
      Vadim Loboda almost 7 years
    • kͩeͣmͮpͥ ͩ
      kͩeͣmͮpͥ ͩ over 6 years
      That MSDN link for "General Summary" no longer shows the article. It was in the September 2008 edition of MSDN Magazine, which you can download as a CHM file, or see via the web archive at: web.archive.org/web/20080913041559/http://msdn.microsoft.com‌​:80/…
    • karns
      karns about 2 years
      It would be fantastic for someone who understands this thoroughly to provide some concrete, practical examples. For instance, what method would work best for warehouse locations? A warehouse can have multiple sections, which can have multiple rows, which can have multiple levels, etc.
  • orangepips
    orangepips over 13 years
    Interesting, the HierarchyId, didn't know about that one: msdn.microsoft.com/en-us/library/bb677290.aspx
  • CesarGon
    CesarGon over 13 years
    Indeed. I work with a lot of recursively hierarchical data, and I find common table expressions extremely useful. See msdn.microsoft.com/en-us/library/ms186243.aspx for an intro.
  • orangepips
    orangepips over 9 years
    This answer would be immensely more useful if the use cases demonstrated, or better yet contrasted, how to query a graph database with SPARQL for instance instead of SQL in an RDBMS.
  • djhallx
    djhallx over 9 years
    SPARQL is relevant to RDF databases which are a subclass of the larger domain of graph databases. I work with InfiniteGraph which is not an RDF database and does not currently support SPARQL. InfiniteGraph supports several different query mechanisms: (1) a graph navigation API for setting up views, filters, path qualifiers and result handlers, (2) a complex graph path pattern matching language, and (3) Gremlin.
  • orangepips
    orangepips over 8 years
    +1 this is a legit approach. From my own experience the key is deciding if you are OK with dirty reads when large update operations occur. If not, it becomes a matter or preventing people from querying tables directly and always going through an API - DB sprocs / functions or code.
  • Thomas
    Thomas about 8 years
    This is an interesting solution; however, I am not sure querying the parent column really offers any major advantage when attempting to find children -- that's why we have left and right columns, in the first place.
  • azerafati
    azerafati about 8 years
    @Thomas, there is a difference between children and descendants. left and right are used to find the descendants.
  • David Mann
    David Mann almost 5 years
    What is a MLMer?
  • Jeff Moden
    Jeff Moden almost 5 years
    MLM = "Multi-Level Marketing". Amway, Shaklee, ACN, etc, etc.