Are duplicate keys allowed in the definition of binary search trees?

133,333

Solution 1

Many algorithms will specify that duplicates are excluded. For example, the example algorithms in the MIT Algorithms book usually present examples without duplicates. It is fairly trivial to implement duplicates (either as a list at the node, or in one particular direction.)

Most (that I've seen) specify left children as <= and right children as >. Practically speaking, a BST which allows either of the right or left children to be equal to the root node, will require extra computational steps to finish a search where duplicate nodes are allowed.

It is best to utilize a list at the node to store duplicates, as inserting an '=' value to one side of a node requires rewriting the tree on that side to place the node as the child, or the node is placed as a grand-child, at some point below, which eliminates some of the search efficiency.

You have to remember, most of the classroom examples are simplified to portray and deliver the concept. They aren't worth squat in many real-world situations. But the statement, "every element has a key and no two elements have the same key", is not violated by the use of a list at the element node.

So go with what your data structures book said!

Edit:

Universal Definition of a Binary Search Tree involves storing and search for a key based on traversing a data structure in one of two directions. In the pragmatic sense, that means if the value is <>, you traverse the data structure in one of two 'directions'. So, in that sense, duplicate values don't make any sense at all.

This is different from BSP, or binary search partition, but not all that different. The algorithm to search has one of two directions for 'travel', or it is done (successfully or not.) So I apologize that my original answer didn't address the concept of a 'universal definition', as duplicates are really a distinct topic (something you deal with after a successful search, not as part of the binary search.)

Solution 2

If your binary search tree is a red black tree, or you intend to any kind of "tree rotation" operations, duplicate nodes will cause problems. Imagine your tree rule is this:

left < root <= right

Now imagine a simple tree whose root is 5, left child is nil, and right child is 5. If you do a left rotation on the root you end up with a 5 in the left child and a 5 in the root with the right child being nil. Now something in the left tree is equal to the root, but your rule above assumed left < root.

I spent hours trying to figure out why my red/black trees would occasionally traverse out of order, the problem was what I described above. Hopefully somebody reads this and saves themselves hours of debugging in the future!

Solution 3

All three definitions are acceptable and correct. They define different variations of a BST.

Your college data structure's book failed to clarify that its definition was not the only possible.

Certainly, allowing duplicates adds complexity. If you use the definition "left <= root < right" and you have a tree like:

      3
    /   \
  2       4

then adding a "3" duplicate key to this tree will result in:

      3
    /   \
  2       4
    \
     3

Note that the duplicates are not in contiguous levels.

This is a big issue when allowing duplicates in a BST representation as the one above: duplicates may be separated by any number of levels, so checking for duplicate's existence is not that simple as just checking for immediate childs of a node.

An option to avoid this issue is to not represent duplicates structurally (as separate nodes) but instead use a counter that counts the number of occurrences of the key. The previous example would then have a tree like:

      3(1)
    /     \
  2(1)     4(1)

and after insertion of the duplicate "3" key it will become:

      3(2)
    /     \
  2(1)     4(1)

This simplifies lookup, removal and insertion operations, at the expense of some extra bytes and counter operations.

Solution 4

In a BST, all values descending on the left side of a node are less than (or equal to, see later) the node itself. Similarly, all values descending on the right side of a node are greater than (or equal to) that node value(a).

Some BSTs may choose to allow duplicate values, hence the "or equal to" qualifiers above. The following example may clarify:

     14
    /  \
  13    22
 /     /  \
1    16    29
          /  \
        28    29

This shows a BST that allows duplicates(b) - you can see that to find a value, you start at the root node and go down the left or right subtree depending on whether your search value is less than or greater than the node value.

This can be done recursively with something like:

def hasVal (node, srchval):
    if node == NULL:
         return false
    if node.val == srchval:
        return true
    if node.val > srchval:
        return hasVal (node.left, srchval)
    return hasVal (node.right, srchval)

and calling it with:

foundIt = hasVal (rootNode, valToLookFor)

Duplicates add a little complexity since you may need to keep searching once you've found your value, for other nodes of the same value. Obviously that doesn't matter for hasVal since it doesn't matter how many there are, just whether at least one exists. It will however matter for things like countVal, since it needs to know how many there are.


(a) You could actually sort them in the opposite direction should you so wish provided you adjust how you search for a specific key. A BST need only maintain some sorted order, whether that's ascending or descending (or even some weird multi-layer-sort method like all odd numbers ascending, then all even numbers descending) is not relevant.


(b) Interestingly, if your sorting key uses the entire value stored at a node (so that nodes containing the same key have no other extra information to distinguish them), there can be performance gains from adding a count to each node, rather than allowing duplicate nodes.

The main benefit is that adding or removing a duplicate will simply modify the count rather than inserting or deleting a new node (an action that may require re-balancing the tree).

So, to add an item, you first check if it already exists. If so, just increment the count and exit. If not, you need to insert a new node with a count of one then rebalance.

To remove an item, you find it then decrement the count - only if the resultant count is zero do you then remove the actual node from the tree and rebalance.

Searches are also quicker given there are fewer nodes but that may not be a large impact.

For example, the following two trees (non-counting on the left, and counting on the right) would be equivalent (in the counting tree, i.c means c copies of item i):

     __14__                    ___22.2___
    /      \                  /          \
  14        22             7.1            29.1
 /  \      /  \           /   \          /    \
1    14  22    29      1.1     14.3  28.1      30.1
 \            /  \
  7         28    30

Removing the leaf-node 22 from the left tree would involve rebalancing (since it now has a height differential of two) the resulting 22-29-28-30 subtree such as below (this is one option, there are others that also satisfy the "height differential must be zero or one" rule):

\                      \
 22                     29
   \                   /  \
    29      -->      28    30
   /  \             /
 28    30         22

Doing the same operation on the right tree is a simple modification of the root node from 22.2 to 22.1 (with no rebalancing required).

Solution 5

In the book "Introduction to algorithms", third edition, by Cormen, Leiserson, Rivest and Stein, a binary search tree (BST) is explicitly defined as allowing duplicates. This can be seen in figure 12.1 and the following (page 287):

"The keys in a binary search tree are always stored in such a way as to satisfy the binary-search-tree property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then y:key <= x:key. If y is a node in the right subtree of x, then y:key >= x:key."

In addition, a red-black tree is then defined on page 308 as:

"A red-black tree is a binary search tree with one extra bit of storage per node: its color"

Therefore, red-black trees defined in this book support duplicates.

Share:
133,333

Related videos on Youtube

Mike Sar
Author by

Mike Sar

Updated on December 14, 2020

Comments

  • Mike Sar
    Mike Sar over 3 years

    I'm trying to find the definition of a binary search tree and I keep finding different definitions everywhere.

    Some say that for any given subtree the left child key is less than or equal to the root.

    Some say that for any given subtree the right child key is greater than or equal to the root.

    And my old college data structures book says "every element has a key and no two elements have the same key."

    Is there a universal definition of a bst? Particularly in regards to what to do with trees with multiple instances of the same key.

    EDIT: Maybe I was unclear, the definitions I'm seeing are

    1) left <= root < right

    2) left < root <= right

    3) left < root < right, such that no duplicate keys exist.

  • SoapBox
    SoapBox over 15 years
    Should perhaps, but it is not required for the properties of a search tree.
  • nickf
    nickf over 15 years
    #1 condition of a BST: "each node (item in the tree) has a distinct value;" en.wikipedia.org/wiki/Binary_search_tree
  • SoapBox
    SoapBox over 15 years
    Wikipedia isn't the end all source of knowledge. Explain to me what is invalid about having duplicate keys in a BST other than wasting space... it maintains all important properties.
  • Mr Fooz
    Mr Fooz over 15 years
    +1: Different people use different definitions. If you implement a new BST, you need to make sure you're explicit about which assumptions you're making about duplicate entries.
  • Mitch Wheat
    Mitch Wheat over 15 years
    The usual convention is less than => left, greater than => right
  • Mike Sar
    Mike Sar over 15 years
    Seems like the consensus is (left <= root <= right) when allowing duplicates. But that some folks definition of a BST does not allow dups. Or maybe some people teach it that way to avoid the additional complexity.
  • Mitch Wheat
    Mitch Wheat over 15 years
    incorrect! it's EITHER left <= root < right OR left < root <= right, OR left > root >= right OR left >= root > right
  • Mike Sar
    Mike Sar over 15 years
    What?! Where are you getting this definition from?
  • Mitch Wheat
    Mitch Wheat over 15 years
    The less than or greater than relationship either points left OR right. Where you implement the equals is implementation dependent but it must only be on one side.
  • Mike Sar
    Mike Sar over 15 years
    Again, provide me a link to a definition of a BST that says the relationship can go either way.
  • paxdiablo
    paxdiablo over 15 years
    @Mitch, it doesn't HAVE to be so, it just makes the processing of the tree a lot easier. Having said that, I'd question the sanity of anyone who implemented it otherwise (arbitrarily left or right rather than same way always).
  • Pacerier
    Pacerier about 12 years
    What are the disadvantages of using a list at the node?
  • Rich
    Rich about 12 years
    Don't rotate when you have equal nodes! Traverse down to the next level and rotate that.
  • bneil
    bneil over 11 years
    For the duplicate case, can you just check if the right child is the same as the current node in the node.val == srchval: clause, and then if so go right?
  • paxdiablo
    paxdiablo almost 8 years
    Other solutions are to either change the tree rule to be left <= node <= right, or only insert before the first occurrence of a value.
  • Björn Lindqvist
    Björn Lindqvist over 7 years
    What problems can this cause in practice? Seem to me that if you are ok with left <= node <= right, then all the red-black tree operations will work out anyhow.
  • SimpleGuy
    SimpleGuy almost 7 years
    @Pacerier I think instead of maintaining a list, we can maintain a reference count at each node and update the count when duplicate occurs. Such an algorithm would be much easier and efficient in searching and storing. Also, it would require minimal changes to the existing algorithm which does not support duplicates.
  • Helin Wang
    Helin Wang almost 6 years
    Could you explain why left <= root <= right is not ideal?
  • Ashwin
    Ashwin almost 6 years
    Have a look at the accepted answer by @paxdiablo - You can see the duplicate value can exist with >=. Ideal depends on your requirements, but if you do have a lot of duplicate values, and you allow the duplicates to exist in the structure, your bst can end up being linear - ie O(n).
  • Oloff Biermann
    Oloff Biermann almost 5 years
    I'm very surprised that this was never even mentioned in the textbook I'm using. The prof didn't mention it either, nor the fact that duplicate keys are even an issue smh...
  • Andrew
    Andrew almost 4 years
    @bneil Way late, but: No, you cannot, because, due to the way self-balancing BST's rebalance to the medians in order to maintain reasonable heights/weights of subtrees (you don't want a doubly linked list), you can easily have situations where the left child, the current node, and the right child are all the same value, unless you were to explicitly ensure that e.g. with a >= comparison, only the leftmost node of a set of duplicates can be the parent (thus ensuring all are right children); this could lead to a disastrous tree if there are many of the same duplicates though.
  • Andrew
    Andrew almost 4 years
    @bneil Lazy Ren's answer explains it well: "...even if search() finds the node with the key, it must traverse down to the leaf node for the nodes with [the] duplicate key." As an example, take the tree in paxdiablo's answer here and swap out that 28 with another 29. You can imagine there might be more 29's in their children as well. duilio's answer has another great example.
  • Andrew
    Andrew almost 4 years
    @OloffBiermann Probably because most people just think, "We've covered Binary Search Trees, ./" without considering the significant implications of allowing duplicates. Perhaps they decided if you understand BST's then you can make your own modifications as needed. In their defense, the number of tree structures alone that are possible is humongous; there are sooooo many different implementation details about them. As just a starter, take a look here: en.wikipedia.org/wiki/List_of_data_structures#Trees
  • Andrew
    Andrew almost 4 years
    @BjörnLindqvist As mentioned in another answer, the problem with allowing <= <= is that you're basically destroying one of the main purposes of having a BST in the first place: to have quick operations on your sorted collection. Unless you do what Rich said or you just put all of the duplicates into the same node, you're then going to have to traverse down possibly to the very bottom of the tree to check for any duplicates.
  • Andrew
    Andrew almost 4 years
    @Rich One problem I have with your solution is it basically assumes that there won't be many, many duplicates of the same key. If there are, then your tree will end up extremely unbalanced. So this should only be considered, if ever, imo, if you're certain that the duplicates won't ever be a very common occurrence. It seems like for a general-purpose BST, duplicates using the same node is the way to go.
  • paxdiablo
    paxdiablo almost 4 years
    A binary search tree doesn't have to allow duplicates, that's just an option. It could also disallow odd numbers, or primes, or strings with too many vowels, or any other type of data. The only real requirement is that it's ordered in some way, and preferably self-balancing.
  • Andrew
    Andrew almost 4 years
    @SimpleGuy If you meant a reference list, I can agree with that. If you really meant a reference count (i.e. have multiple nodes but one stores the count), I think that's not going to work. Which node will maintain the count? What if the tree needs to rebalance that node to a lower level etc.?
  • Andrew
    Andrew almost 4 years
    This was a great answer. I'd mark it as the answer if I could. #4 is definitely the "right" way. (P.S. There's a 6th way not addressed here, but I responded to it with a comment below: stackoverflow.com/a/339597/1599699)
  • Björn Lindqvist
    Björn Lindqvist almost 4 years
    @Andrew Why? You'd designate the left or right branch for equality and you'll have just as fast lookup as before.
  • Andrew
    Andrew almost 4 years
    @BjörnLindqvist No you won't. Either you'll need to break that paradigm for the rebalance, or as I said you're opening yourself up to a potentially awful BST if you have many duplicates of the same key. If you break that paradigm for the rebalance, then you could have right descendants of the left child or left descendants of the right child that are equivalent after the rebalance.
  • Nuclearman
    Nuclearman over 3 years
    Really like the counter approach here, seems like a solid example.