How to find the max distance between a set of nodes on a tree?

15,851

Solution 1

Your problem is also known as finding the diameter of a tree: among all shortest paths between pairs of nodes, you're seeking the longest.

Denote the diameter of a tree S by d(S), and its height by h(S).

The two most distant nodes in a tree S with subtrees S1...Sd can be either under one of its subtrees, or they might span two subtrees. In the first case, when the two most distant nodes are under subtree Si, d(S) is just d(Si). In the second case, when the two most distance nodes span two subtrees, say Si and Sj, their distance is h(Si) + h(Sj) + 2 because the two nodes must be the two deepest nodes in each subtree, plus two more edges to join the two subtrees. In fact, in this second case, Si and Sj must be the heighest and second heighest subtree of S.

An O(n) algorithm would proceed as follows

Algorithm d(S)

1. recursively compute d(S1)...d(Sd) and h(S1)...h(Sd) of the subtrees of S.
2. denote by Si be the deepest subtree and Sj the second deepest subtree
3. return max(d(S1), ..., d(Sd), h(Si)+h(Sj)+2)

Analysis

Lines 2 and 3 each take O(d) time to compute. But each node is examined only once by these lines, so in recursion, this takes a total of O(n).

Solution 2

I have a simple O(n) greedy algorithm for this interesting problem.

Algorithm

  1. Select arbitrary vertice X as the root of the tree, and then find the vertice Y who has the maximum distance with the root X. The complexity of this step is O(n).
  2. Make vertice Y as the new root of the tree, then find the vertice Z who has the maximum distance with the root Y. And the distance between Y and Z is the maximum of the distances in the tree. The complexity of this step is also O(n).
  3. The total complexity of this greedy algorithm is O(n).

Proof

  • It is obviously that Y and Z form one diameter of the tree, and we call Y and Z a pair of corners of the tree.
  • Theorem: For each vertice P in the tree, either Y or Z would be the vertice who has the maximum distance with it.
  • The step 1 of the algorithm is based on Theorem, so we can easily get one corner (Y) of the tree.
  • The second corner Z is found based on the Theorem as well.

Extend

According to the Theorem in the proof, we can solve another more challenging problem: For each vertice in the tree, calculate who is fareast vertice to it.

  • We can find the two corners of the tree in O(n) complexity, and then we can use the Theorem again.
  • From corner Y and Z we do dfs respectively, and for each vertices p[i] we can get the distance to Y and Z(we call them disY[i] and disZ[i]), so the fareast distance for p[i] is max(disY[i], disZ[i]). Due to we just do dfs twice, so we can get the infomation in O(n) complexity.
  • This extended problem can also solved by complexible tree dynamic programming whose complexity is also O(n).

Solution 3

Assume the maximum-length path between two nodes goes through our root-node. Then one of the two nodes must belong to one child's subtree, and the other must belong to a different child's subtree. It's easy to see, then, that those two nodes are the lowest/deepest descendants of those two children, which means the distance between those two nodes is height(child1) + height(child2) + 2. Thus, the maximum-length path between two nodes that goes through our root is max-height-of-a-child + second-to-max-height-of-a-child + 2.

This gives us a simple O(n) algorithm to find the overall maximum-length path: just do the above for every non-leaf node. Since every path must be rooted at some non-leaf node, this guarantees we'll consider the correct path at some point.

Finding the height of a subtree is O(n), but, since you can build up the heights recursively, finding the height of every subtree is conveniently also O(n). In fact, you don't even need to find the heights as a separate step; you could find the max-length path and subtree-heights at the same time, meaning this algorithm only requires O(n) time and O(height-of-tree) space.

Solution 4

It's a recursive algorithm. Here's the pseudo code (untested ocaml code):

 type result = {n1 : node; n2 : node; d1 : int (* depth of node n1 *); d2 : int; distance: int}
(* a struct containing:
    - the couple of nodes (n1,n2),
    - the depth of the nodes, with depth(n1) >= depth(n2)
    - the distance between n1 & n2 *)


let find_max (n : node) : result =
 let max (s1 : result) (s2 : result) = if s1.distance < s2.distance then s2 else s1 in
 let cl : node list = Node.children n in
 if cl = []
 then { n1 = n; n2 = n; d1 = 0; d2 = 0; distance = 0 }
 else 
   let ml = List.map find_max cl in
   let nl = List.map (fun e -> e.n1, e.d1+1) ml in
   let (k1,d1)::(k2,d2)::nl = nl in
   let k1,d1,k2,d2 = if d1 > d2 then k1,d1,k2,d2 else k2,d2,k1,d1 in
   let s = {n1 = k1;n2 = k2; d1 = d1; d2 = d2; distance = d1+d2} in
   let m1 =  List.fold_left (fun r (e,d) -> 
                      if r.d1< d
                      then { r with n1 = e; d1 = d; distance = d+d2 }
                      else if r.d2 < d 
                               then { r with n2 = e; d2 = d; distance = d+d1 }
                               else r) s nl in
   max m1 (List.fold_left max (List.hd ml) (List.tl ml))

The m1 value is built by keeping the two deepest nodes of the nl list with distance being the sum of their depths.

List.map is a function applying a given function to all the elements of a list and returning the resulting list.

List.fold_left is a function applying recursively a given function to an accumulator and the elements of a list, each time using the result of the previous application as the new accumulator value. The result is the last accumulator value.

List.hd returns the first element of a list.

List.tl returns a list without the first element.

Share:
15,851
Daryl
Author by

Daryl

Updated on June 06, 2022

Comments

  • Daryl
    Daryl almost 2 years

    I have a set of n nodes on a (non-binary) tree. I want to find the maximum of the distances between any two of the nodes. (I define the distance between two nodes to be the sum of the distances between those nodes and their lowest common ancestor).

    I can easily solve this problem in O(n^2) by just computing the distance between each node to each other node and getting the maximum, however I'm hoping for something better as this is far too slow* for my application scenario.

    (Extra info: In my application scenario, these nodes are actually files and the tree is a directory structure. Therefore, the tree is quite shallow (depth < ~10), but it may have 300,000+ nodes. The sets of files can be anywhere between ~3-200 in size. Effectively, I'm trying to figure out how far spread out are my files in each set.)*

    Edit: Perhaps I can ask an equivalent question to prompt more answers: Consider a subset of the original tree that only contains the nodes in my set and the nodes necessary to connect them. Then the question becomes: How do I find the longest simple path in an undirected acyclical graph?

    *Edit 2: As didierc pointed out, I actually should be considering sets of folders not files. This makes my sets smaller and the exhaustive approach may be fast enough. Still, it would be beneficial to see a faster solution, and I'm curious to see if there is one.

  • moos
    moos over 11 years
    for some reason, it's not letting me comment @BlueRaja 's solution so i'm commenting here. it's possible that the two most distant nodes are under the same subtree, in which case your algorithm would incorrectly try to span two subtrees. For example, your algorithm gets the wrong diameter when the root node has two subtrees, one of which is a very deep balanced tree and the other of which is a single node.
  • moos
    moos over 11 years
    there's a subtle flaw in this algorithm. it's possible that the two most distant nodes are under the same subtree, in which case your algorithm would incorrectly still try to span two subtrees. For example, your algorithm gets the wrong diameter when the root node has two subtrees, one of which is a very deep balanced tree and the other of which is a single node.
  • BlueRaja - Danny Pflughoeft
    BlueRaja - Danny Pflughoeft over 11 years
    @moos: Incorrect; if both nodes are under the same subtree, then the root node will not be part of the longest path between two nodes (that is, it's not those nodes' greatest common ancestor). In that case, you will find the greatest common ancestor while recursing (remember, you need to keep track of the greatest distance found after running this algorithm over all non-leaf nodes). I've updated this answer to hopefully make that more clear.
  • BlueRaja - Danny Pflughoeft
    BlueRaja - Danny Pflughoeft over 11 years
    There is no flaw in my algorithm; in fact, our algorithms are exactly the same.
  • moos
    moos over 11 years
    there remains one more issue: +1 should be +2 (spanning across the root requires two edges, not just one)
  • BlueRaja - Danny Pflughoeft
    BlueRaja - Danny Pflughoeft over 11 years
    @moos: hehe yep, I just noticed that myself, and have already fixed it :)
  • Daryl
    Daryl over 11 years
    Wow, I really appreciate the simple elegance of this solution.
  • Daryl
    Daryl over 11 years
    Thanks for the information about the tree's diameter; I didn't know this. I chose your answer because it was complete, insightful, understandable and practical to implement.
  • moos
    moos over 11 years
    great solution. here's a proof that it works. The algorithm finds a pair of nodes x0,y0 such that max_x d(x,x0) = max_y d(x0,y) (that is, x0 and y0 are each others' farthest nodes). For any such pair, d(x0,y0) is the diameter. Proof: let x*,y* be two nodes s.t. d(x*,y*) is the diameter. there exist nodes r and s such so that the paths look like x0--r--s--y0 and x*--r--s--y*. Suppose d(x0,y0) < d(x*,y*), then either d(x0,y*) > d(x0,y0) or d(y*,x0) > d(y0,x0), contradicting the fact that x0 and y0 are each others' farthest points. therefore d(x0,y0)=d(x*,y*).
  • Niklas B.
    Niklas B. over 10 years
    There's another nice linear algorithm to find the diameter: choose any node and run DFS on the tree. Take the node that was furthest from the start node. Run a DFS from that furthest node. The maximum distance in the second DFS is the diameter.