What are the options for storing hierarchical data in a relational database?
Solution 1
My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.
The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.
I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.
Duration for 1,000 Nodes = 00:00:00:870
Duration for 10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for 100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100)
Duration for 1,000,000 Nodes = 'Didn't even try this'
And here's the duration for the new method (with the push stack method in parenthesis).
Duration for 1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for 10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for 100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)
Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.
You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/
I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/
If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.
Solution 2
Adjacency Model + Nested Sets Model
I went for it because I could insert new items to the tree easily (you just need a branch's id to insert a new item to it) and also query it quite fast.
+-------------+----------------------+--------+-----+-----+
| category_id | name | parent | lft | rgt |
+-------------+----------------------+--------+-----+-----+
| 1 | ELECTRONICS | NULL | 1 | 20 |
| 2 | TELEVISIONS | 1 | 2 | 9 |
| 3 | TUBE | 2 | 3 | 4 |
| 4 | LCD | 2 | 5 | 6 |
| 5 | PLASMA | 2 | 7 | 8 |
| 6 | PORTABLE ELECTRONICS | 1 | 10 | 19 |
| 7 | MP3 PLAYERS | 6 | 11 | 14 |
| 8 | FLASH | 7 | 12 | 13 |
| 9 | CD PLAYERS | 6 | 15 | 16 |
| 10 | 2 WAY RADIOS | 6 | 17 | 18 |
+-------------+----------------------+--------+-----+-----+
- Every time you need all children of any parent you just query the
parent
column. - If you needed all descendants of any parent you query for items which have their
lft
betweenlft
andrgt
of parent. - If you needed all parents of any node up to the root of the tree, you query for items having
lft
lower than the node'slft
andrgt
bigger than the node'srgt
and sort the byparent
.
I needed to make accessing and querying the tree faster than inserts, that's why I chose this
The only problem is to fix the left
and right
columns when inserting new items. well I created a stored procedure for it and called it every time I inserted a new item which was rare in my case but it is really fast.
I got the idea from the Joe Celko's book, and the stored procedure and how I came up with it is explained here in DBA SE
https://dba.stackexchange.com/q/89051/41481
Solution 3
This is a very partial answer to your question, but I hope still useful.
Microsoft SQL Server 2008 implements two features that are extremely useful for managing hierarchical data:
- the HierarchyId data type.
- common table expressions, using the with keyword.
Have a look at "Model Your Data Hierarchies With SQL Server 2008" by Kent Tegels on MSDN for starts. See also my own question: Recursive same-table query in SQL Server 2008
Solution 4
This design was not mentioned yet:
Multiple lineage columns
Though it has limitations, if you can bear them, it's very simple and very efficient. Features:
- Columns: one for each lineage level, refers to all the parents up to the root, levels below the current items' level are set to 0 (or NULL)
- There is a fixed limit to how deep the hierarchy can be
- Cheap ancestors, descendants, level
- Cheap insert, delete, move of the leaves
- Expensive insert, delete, move of the internal nodes
Here follows an example - taxonomic tree of birds so the hierarchy is Class/Order/Family/Genus/Species - species is the lowest level, 1 row = 1 taxon (which corresponds to species in the case of the leaf nodes):
CREATE TABLE `taxons` (
`TaxonId` smallint(6) NOT NULL default '0',
`ClassId` smallint(6) default NULL,
`OrderId` smallint(6) default NULL,
`FamilyId` smallint(6) default NULL,
`GenusId` smallint(6) default NULL,
`Name` varchar(150) NOT NULL default ''
);
and the example of the data:
+---------+---------+---------+----------+---------+-------------------------------+
| TaxonId | ClassId | OrderId | FamilyId | GenusId | Name |
+---------+---------+---------+----------+---------+-------------------------------+
| 254 | 0 | 0 | 0 | 0 | Aves |
| 255 | 254 | 0 | 0 | 0 | Gaviiformes |
| 256 | 254 | 255 | 0 | 0 | Gaviidae |
| 257 | 254 | 255 | 256 | 0 | Gavia |
| 258 | 254 | 255 | 256 | 257 | Gavia stellata |
| 259 | 254 | 255 | 256 | 257 | Gavia arctica |
| 260 | 254 | 255 | 256 | 257 | Gavia immer |
| 261 | 254 | 255 | 256 | 257 | Gavia adamsii |
| 262 | 254 | 0 | 0 | 0 | Podicipediformes |
| 263 | 254 | 262 | 0 | 0 | Podicipedidae |
| 264 | 254 | 262 | 263 | 0 | Tachybaptus |
This is great because this way you accomplish all the needed operations in a very easy way, as long as the internal categories don't change their level in the tree.
Solution 5
If your database supports arrays, you can also implement a lineage column or materialized path as an array of parent ids.
Specifically with Postgres you can then use the set operators to query the hierarchy, and get excellent performance with GIN indices. This makes finding parents, children, and depth pretty trivial in a single query. Updates are pretty manageable as well.
I have a full write up of using arrays for materialized paths if you're curious.
orangepips
Updated on July 08, 2022Comments
-
orangepips almost 2 years
Good Overviews
Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:
- One more Nested Intervals vs. Adjacency List comparison: the best comparison of Adjacency List, Materialized Path, Nested Set, and Nested Interval I've found.
- Models for hierarchical data: slides with good explanations of tradeoffs and example usage
- Representing hierarchies in MySQL: very good overview of Nested Set in particular
- Hierarchical data in RDBMSs: a most comprehensive and well-organized set of links I've seen, but not much in the way of explanation
Options
Ones I am aware of and general features:
- Columns: ID, ParentID
- Easy to implement.
- Cheap node moves, inserts, and deletes.
- Expensive to find the level, ancestry & descendants, path
- Avoid N+1 via Common Table Expressions in databases that support them
- Columns: Left, Right
- Cheap ancestry, descendants
- Very expensive
O(n/2)
moves, inserts, deletes due to volatile encoding
- Bridge Table (a.k.a. Closure Table /w triggers)
- Uses separate join table with ancestor, descendant, depth (optional)
- Cheap ancestry and descendants
- Writes costs
O(log n)
(size of the subtree) for insert, updates, deletes - Normalized encoding: good for RDBMS statistics & query planner in joins
- Requires multiple rows per node
- Lineage Column (a.k.a. Materialized Path, Path Enumeration)
- Column: lineage (e.g. /parent/child/grandchild/etc...)
- Cheap descendants via prefix query (e.g.
LEFT(lineage, #) = '/enumerated/path'
) - Writes costs
O(log n)
(size of the subtree) for insert, updates, deletes - Non-relational: relies on Array datatype or serialized string format
- Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
- Has real/float/decimal representation/precision issues
- Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with the added trickiness of linear algebra.
- A modified Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
- Cheap to iterate/paginate over
- Expensive move and delete
- Good Use: threaded discussion - forums / blog comments
- Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL
- Cheap ancestors, descendants, level
- Cheap insert, delete, move of the leaves
- Expensive insert, delete, move of the internal nodes
- Hard limit to how deep the hierarchy can be
Database Specific Notes
MySQL
Oracle
- Use CONNECT BY to traverse Adjacency Lists
PostgreSQL
- ltree datatype for Materialized Path
SQL Server
- General summary
- 2008 offers HierarchyId data type that appears to help with the Lineage Column approach and expand the depth that can be represented.
-
Gili over 11 yearsAccording to slideshare.net/billkarwin/sql-antipatterns-strike-back page 77,
Closure Tables
are superior toAdjacency List
,Path Enumeration
andNested Sets
in terms of ease of use (and I'm guessing performance as well). -
Lothar over 8 yearsI miss a very simple version here: a simple BLOB. If your hierarchy only has a few dozend items a serialized tree of id's might be the best option.
-
orangepips over 8 years@Lothar: question is a community wiki so feel free to have at it. My thought in that regard is I would only do it with those databases that support some sort of blob structuring such as XML with a stable query language such as XPATH. Otherwise I don't see a good way of querying aside from retrieve, deserialize, and munge in code, not SQL. And if you really have a problem where you need a lot of arbitrary elements you might be better off using Node database like Neo4J, which I've used and liked, albeit never taken through to production.
-
Vadim Loboda almost 7 yearsFor MS SQL Server: Combination of Id-ParentId and HierarchyId Approaches to Hierarchical Data
-
kͩeͣmͮpͥ ͩ over 6 yearsThat MSDN link for "General Summary" no longer shows the article. It was in the September 2008 edition of MSDN Magazine, which you can download as a CHM file, or see via the web archive at: web.archive.org/web/20080913041559/http://msdn.microsoft.com:80/…
-
karns about 2 yearsIt would be fantastic for someone who understands this thoroughly to provide some concrete, practical examples. For instance, what method would work best for warehouse locations? A warehouse can have multiple sections, which can have multiple rows, which can have multiple levels, etc.
-
orangepips over 13 yearsInteresting, the HierarchyId, didn't know about that one: msdn.microsoft.com/en-us/library/bb677290.aspx
-
CesarGon over 13 yearsIndeed. I work with a lot of recursively hierarchical data, and I find common table expressions extremely useful. See msdn.microsoft.com/en-us/library/ms186243.aspx for an intro.
-
orangepips over 9 yearsThis answer would be immensely more useful if the use cases demonstrated, or better yet contrasted, how to query a graph database with SPARQL for instance instead of SQL in an RDBMS.
-
djhallx over 9 yearsSPARQL is relevant to RDF databases which are a subclass of the larger domain of graph databases. I work with InfiniteGraph which is not an RDF database and does not currently support SPARQL. InfiniteGraph supports several different query mechanisms: (1) a graph navigation API for setting up views, filters, path qualifiers and result handlers, (2) a complex graph path pattern matching language, and (3) Gremlin.
-
orangepips over 8 years+1 this is a legit approach. From my own experience the key is deciding if you are OK with dirty reads when large update operations occur. If not, it becomes a matter or preventing people from querying tables directly and always going through an API - DB sprocs / functions or code.
-
Thomas about 8 yearsThis is an interesting solution; however, I am not sure querying the parent column really offers any major advantage when attempting to find children -- that's why we have left and right columns, in the first place.
-
azerafati about 8 years@Thomas, there is a difference between
children
anddescendants
.left
andright
are used to find the descendants. -
David Mann almost 5 yearsWhat is a MLMer?
-
Jeff Moden almost 5 yearsMLM = "Multi-Level Marketing". Amway, Shaklee, ACN, etc, etc.