Calculate minimal operations to make two tree structures identical

32,035

Solution 1

There is not only a Wikipedia article on graph isomorphism (as Space_C0wb0y points out) but also a dedicated article on the graph isomorphism problem. It has a section Solved special cases for which polynomial-time solutions are known. Trees is one of them and it cites the following two references:

Solution 2

You weren't clear if you were comparing abstract syntax trees for source code, XML documents interpreted as trees, or some other type of tree.

There's a number of papers that discuss comparing syntax trees and computing minimum distances by various means. The ideas should be relevant.

A good paper is Change Distilling, which tries to compare the source code for two abstract syntax trees and report a minimal difference. The paper talks about a specific method, and also briefly mentions (and provides references) to a variety of similar techniques.

Few of these algorithms are actually realized in available tools for comparing computer program source text. Our Smart Differencer is one of them; it is driven by an explicit language grammar for many languages.

Solution 3

Although this question is old, i'll add a couple more references and algorithms below:

  1. X-Diff: An Effective Change Detection Algorithm for XML Documents , Yuan Wang, David J. DeWitt, Jin-Yi Cai
  2. KF-Diff+: Highly Efficient Change Detection Algorithm for XML Documents
  3. diffX: An Algorithm to Detect Changes in Multi-Version XML Documents
  4. Change Detection in XML Trees: a Survey, Luuk Peters
  5. Similarity in Tree Data Structures

Furthermore, there are libraries and frameworks on GitHub (in javascript) which implement diffing of Tree-like structures for example applications dealing with JSON data or XML Trees (e.g for client-side MVC/MVVM):

  1. React.js
  2. JSON-Patch
  3. jsondiffpatch
  4. objectDiff

Solution 4

In case folks find this question and need something implemented for Node.js or the browser, I'm providing a link and code example for an implementation I've written that you can find on github here: (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram).

This is a tree edit distance approximation algorithm, but it is much, much faster than trying to find the true edit distance. The approximation performs in O(n log n) time and O(n) space whereas true edit distance is often O(n^3) or O(n^2) using known algorithms for true edit distance. See the academic paper from which the PQ-Gram algorithm comes: (http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf)

So using jqgram:

Example:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3 },
function(result) {
    console.log(result.distance);
});

And that gives you a number between 0 and 1. The closer to zero, the more closely related the two trees look to jqgram. One approach could be to use jqgram to narrow in on several closely related trees from among many trees given its speed, then utilize true edit distance on the few trees remaining that you need to take a closer inspection of, and for that you can find python implementations for reference or port of the Zhang & Shasha algorithm for example.

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, its Javascript, so use it client or server-side!

ALSO, to answer with respect to cycle detection, check out the clone method inside of jqgram, there is cycle detection there, but credit for that goes to the author of node-clone from which that piece was slightly modified and included.

Solution 5

This is called the tree to tree correction problem or the tree to tree editing problem. Most of the literature dealing with this explicitly relates to comparing XML trees for some reason, so searching for "XML diffing algorithm" yields a lot of results. In addition to Nikos's list of links, I found these:

I also strongly recommend reading Change Detection in XML Trees: a Survey but it is from 2005 so barely any of the tools it mentions exist anymore. Comparing XML Documents as Reference-aware Labeled Ordered Trees has the best intuitive description of some of the algorithms that I have found so far (start at section 2.1.2).

Unfortunately there doesn't seem to be much open source code available that does this and isn't ancient. Just a lot of overly-complex papers. :-/

Share:
32,035
Tomas Vana
Author by

Tomas Vana

Developer living in a lovely city of Vienna in Austria, Europe. At work doing mostly .NET and web development, intranet web applications and data warehousing solutions. For fun developing iPhone Apps, with the first one released recently at AppStore.

Updated on May 02, 2021

Comments

  • Tomas Vana
    Tomas Vana about 3 years

    This is more of a CS question, but an interesting one :

    Let's say we have 2 tree structures with more or less the same nodes reorganized. How would you find

    1. any
    2. in some sense minimal

    sequence of operations

    • MOVE(A, B) - moves node A under node B (with the whole subtree)
    • INSERT(N, B) - inserts a new node N under node B
    • DELETE (A) - deletes the node A (with the whole subtree)

    that transforms one tree to the other.

    There might obviously be cases where such transformation is not possible, trivial being root A with child B to root B with child A etc.). In such cases, the algorithm would simply deliver an result "not possible".

    Even more spectacular version is a generalization for networks, i.e. when we assume that a node can occur multiple times in the tree (effectively having multiple "parents"), while cycles are forbidden.

    Disclaimer : This is not a homework, actually it comes from a real business problem and I found it quite interesting wondering if somebody might know a solution.

  • Tomas Vana
    Tomas Vana almost 13 years
    Actually, in our case it's not source code, these are really trees. There is some semantic in those trees, but all in all not that important - they are directly manipulated by the users as a tree
  • coder14
    coder14 over 7 years
    Broken link: I just spent 20 minutes looking for the "Change Distilling" paper. Here's the updated link: merlin.uzh.ch/publication/show/2531 The software project itself has moved to bitbucket.org/sealuzh/tools-changedistiller/wiki/Home (which is how I got the correct link to the PDF)
  • john ktejik
    john ktejik almost 6 years
    does this allow multiple lfn ? I want to match more than the label, ie. also the stored value. node.value.
  • Timmmm
    Timmmm over 4 years
    Highly recommend reading the Change Detection in XML Trees: a Survey paper - it lists dozens of algorithms for XML diffing (which is just tree diffing).
  • Mengo
    Mengo almost 4 years
    I'm not be able to see this paper though, is the pdf link broken? Change Detection in XML Trees: a Survey
  • Timmmm
    Timmmm almost 4 years
    Works for me. Did you click the Download full-test PDF button? Maybe try Sci-hub if it's blocked for some reason.