Ukkonen's suffix tree algorithm in plain English

191,855

Solution 1

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

First, a few preliminary statements.

  1. What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

  2. But: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

Basic principle

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:

abc

The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

So, we start from the left, and first insert only the single character a by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#], which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol # to mean the current end, which is at position 1 (right after a).

So we have an initial tree, which looks like this:

And what it means is this:

Now we progress to position 2 (right after b). Our goal at each step is to insert all suffixes up to the current position. We do this by

  • expanding the existing a-edge to ab
  • inserting one new edge for b

In our representation this looks like

enter image description here

And what it means is:

We observe two things:

  • The edge representation for ab is the same as it used to be in the initial tree: [0,#]. Its meaning has automatically changed because we updated the current position # from 1 to 2.
  • Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.

Next we increment the position again and update the tree by appending a c to every existing edge and inserting one new edge for the new suffix c.

In our representation this looks like

And what it means is:

We observe:

  • The tree is the correct suffix tree up to the current position after each step
  • There are as many steps as there are characters in the text
  • The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing #, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.

First extension: Simple repetitions

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:

abcabxabcd

It starts with abc as in the previous example, then ab is repeated and followed by x, and then abc is repeated followed by d.

Steps 1 through 3: After the first 3 steps we have the tree from the previous example:

Step 4: We move # to position 4. This implicitly updates all existing edges to this:

and we need to insert the final suffix of the current step, a, at the root.

Before we do this, we introduce two more variables (in addition to #), which of course have been there all the time but we haven't used them so far:

  • The active point, which is a triple (active_node,active_edge,active_length)
  • The remainder, which is an integer indicating how many new suffixes we need to insert

The exact meaning of these two will become clear soon, but for now let's just say:

  • In the simple abc example, the active point was always (root,'\0x',0), i.e. active_node was the root node, active_edge was specified as the null character '\0x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information.
  • The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

Now this is going to change. When we insert the current final character a at the root, we notice that there is already an outgoing edge starting with a, specifically: abca. Here is what we do in such a case:

  • We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are.
  • We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description).
  • We also increment remainder, so at the beginning of the next step it will be 2.

Observation: When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder). The tree is then not an accurate representation of the suffix tree up to the current position any more, but it contains all suffixes (because the final suffix a is contained implicitly). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was no work done in this step.

Step 5: We update the current position # to 5. This automatically updates the tree to this:

And because remainder is 2, we need to insert two final suffixes of the current position: ab and b. This is basically because:

  • The a suffix from the previous step has never been properly inserted. So it has remained, and since we have progressed one step, it has now grown from a to ab.
  • And we need to insert the new final edge b.

In practice this means that we go to the active point (which points to behind the a on what is now the abcab edge), and insert the current final character b. But: Again, it turns out that b is also already present on that same edge.

So, again, we do not change the tree. We simply:

  • Update the active point to (root,'a',2) (same node and edge as before, but now we point to behind the b)
  • Increment the remainder to 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.

To be clear: We had to insert ab and b in the current step, but because ab was already found, we updated the active point and did not even attempt to insert b. Why? Because if ab is in the tree, every suffix of it (including b) must be in the tree, too. Perhaps only implicitly, but it must be there, because of the way we have built the tree so far.

We proceed to step 6 by incrementing #. The tree is automatically updated to:

Because remainder is 3, we have to insert abx, bx and x. The active point tells us where ab ends, so we only need to jump there and insert the x. Indeed, x is not there yet, so we split the abcabx edge and insert an internal node:

The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

So we have dealt with abx and decrement remainder to 2. Now we need to insert the next remaining suffix, bx. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called Rule 1 below, and it applies whenever the active_node is root (we will learn rule 3 for other cases further below). Here is rule 1:

After an insertion from root,

  • active_node remains root
  • active_edge is set to the first character of the new suffix we need to insert, i.e. b
  • active_length is reduced by 1

Hence, the new active-point triple (root,'b',1) indicates that the next insert has to be made at the bcabx edge, behind 1 character, i.e. behind b. We can identify the insertion point in O(1) time and check whether x is already present or not. If it was present, we would end the current step and leave everything the way it is. But x is not present, so we insert it by splitting the edge:

Again, this took O(1) time and we update remainder to 1 and the active point to (root,'x',0) as rule 1 states.

But there is one more thing we need to do. We'll call this Rule 2:

If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

We still need to insert the final suffix of the current step, x. Since the active_length component of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with x, we insert a new edge:

As we can see, in the current step all remaining inserts were made.

We proceed to step 7 by setting #=7, which automatically appends the next character, a, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to (root,'a',1).

In step 8, #=8, we append b, and as seen before, this only means we update the active point to (root,'a',2) and increment remainder without doing anything else, because b is already present. However, we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to (node1,'\0x',0). Here, I use node1 to refer to the internal node the ab edge ends at.

Then, in step #=9, we need to insert 'c' and this will help us to understand the final trick:

Second extension: Using suffix links

As always, the # update appends c automatically to the leaf edges and we go to the active point to see if we can insert 'c'. It turns out 'c' exists already at that edge, so we set the active point to (node1,'c',1), increment remainder and do nothing else.

Now in step #=10, remainder is 4, and so we first need to insert abcd (which remains from 3 steps ago) by inserting d at the active point.

Attempting to insert d at the active point causes an edge split in O(1) time:

The active_node, from which the split was initiated, is marked in red above. Here is the final rule, Rule 3:

After splitting an edge from an active_node that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the active_node to the node it points to. If there is no suffix link, we set the active_node to the root. active_edge and active_length remain unchanged.

So the active point is now (node2,'c',1), and node2 is marked in red below:

Since the insertion of abcd is complete, we decrement remainder to 3 and consider the next remaining suffix of the current step, bcd. Rule 3 has set the active point to just the right node and edge so inserting bcd can be done by simply inserting its final character d at the active point.

Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:

We observe: Suffix links enable us to reset the active point so we can make the next remaining insert at O(1) effort. Look at the graph above to confirm that indeed node at label ab is linked to the node at b (its suffix), and the node at abc is linked to bc.

The current step is not finished yet. remainder is now 2, and we need to follow rule 3 to reset the active point again. Since the current active_node (red above) has no suffix link, we reset to root. The active point is now (root,'c',1).

Hence the next insert occurs at the one outgoing edge of the root node whose label starts with c: cabxabcd, behind the first character, i.e. behind c. This causes another split:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

With this, remainder can be set to 1 and since the active_node is root, we use rule 1 to update the active point to (root,'d',0). This means the final insert of the current step is to insert a single d at root:

That was the final step and we are done. There are number of final observations, though:

  • In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time.

  • But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.

  • remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is).

  • After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.

  • If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later.

  • What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below.

  • So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainder inserts, each taking O(1) time. Since remainder indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).

  • However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. For example, consider a situation like this:

(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)

Now let the active point be (red,'d',3), so it points to the place behind the f on the defg edge. Now assume we made the necessary updates and now follow the suffix link to update the active point according to rule 3. The new active point is (green,'d',3). However, the d-edge going out of the green node is de, so it has only 2 characters. In order to find the correct active point, we obviously need to follow that edge to the blue node and reset to (blue,'f',1).

In a particularly bad case, the active_length could be as large as remainder, which can be as large as n. And it might very well happen that to find the correct active point, we need not only jump over one internal node, but perhaps many, up to n in the worst case. Does that mean the algorithm has a hidden O(n2) complexity, because in each step remainder is generally O(n), and the post-adjustments to the active node after following a suffix link could be O(n), too?

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_length will be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_length can only decrease, and the number of active-point adjustments we can make on the way can't be larger than active_length at any given time. Since active_length can never be larger than remainder, and remainder is O(n) not only in every single step, but the total sum of increments ever made to remainder over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).

Solution 2

I tried to implement the Suffix Tree with the approach given in jogojapan's answer, but it didn't work for some cases due to wording used for the rules. Moreover, I've mentioned that nobody managed to implement an absolutely correct suffix tree using this approach. Below I will write an "overview" of jogojapan's answer with some modifications to the rules. I will also describe the case when we forget to create important suffix links.

Additional variables used

  1. active point - a triple (active_node; active_edge; active_length), showing from where we must start inserting a new suffix.
  2. remainder - shows the number of suffixes we must add explicitly. For instance, if our word is 'abcaabca', and remainder = 3, it means we must process 3 last suffixes: bca, ca and a.

Let's use a concept of an internal node - all the nodes, except the root and the leafs are internal nodes.

Observation 1

When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder).

Observation 2

If at some point active_length is greater or equal to the length of current edge (edge_length), we move our active point down until edge_length is strictly greater than active_length.

Now, let's redefine the rules:

Rule 1

If after an insertion from the active node = root, the active length is greater than 0, then:

  1. active node is not changed
  2. active length is decremented
  3. active edge is shifted right (to the first character of the next suffix we must insert)

Rule 2

If we create a new internal node OR make an inserter from an internal node, and this is not the first SUCH internal node at current step, then we link the previous SUCH node with THIS one through a suffix link.

This definition of the Rule 2 is different from jogojapan', as here we take into account not only the newly created internal nodes, but also the internal nodes, from which we make an insertion.

Rule 3

After an insert from the active node which is not the root node, we must follow the suffix link and set the active node to the node it points to. If there is no a suffix link, set the active node to the root node. Either way, active edge and active length stay unchanged.

In this definition of Rule 3 we also consider the inserts of leaf nodes (not only split-nodes).

And finally, Observation 3:

When the symbol we want to add to the tree is already on the edge, we, according to Observation 1, update only active point and remainder, leaving the tree unchanged. BUT if there is an internal node marked as needing suffix link, we must connect that node with our current active node through a suffix link.

Let's look at the example of a suffix tree for cdddcdc if we add a suffix link in such case and if we don't:

  1. If we DON'T connect the nodes through a suffix link:

    • before adding the last letter c:

    • after adding the last letter c:

  2. If we DO connect the nodes through a suffix link:

    • before adding the last letter c:

    • after adding the last letter c:

Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very important for our approach with active point. The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3, because, according to it, if there's no a suffix link, then we must put the active_node to the root.

When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled 'c'). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active point approach, the active node was set to the red node. But we don't make an insert from the red node, as the letter 'c' is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active point approach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shorter suffix.

Finally, here are my implementations of the Suffix Tree:

  1. Java
  2. C++

Hope that this "overview" combined with jogojapan's detailed answer will help somebody to implement his own Suffix Tree.

Solution 3

Apologies if my answer seems redundant, but I implemented Ukkonen's algorithm recently, and found myself struggling with it for days; I had to read through multiple papers on the subject to understand the why and how of some core aspects of the algorithm.

I found the 'rules' approach of previous answers unhelpful for understanding the underlying reasons, so I've written everything below focusing solely on the pragmatics. If you've struggled with following other explanations, just like I did, perhaps my supplemental explanation will make it 'click' for you.

I published my C# implementation here: https://github.com/baratgabor/SuffixTree

Please note that I'm not an expert on this subject, so the following sections may contain inaccuracies (or worse). If you encounter any, feel free to edit.

Prerequisites

The starting point of the following explanation assumes you're familiar with the content and use of suffix trees, and the characteristics of Ukkonen's algorithm, e.g. how you're extending the suffix tree character by character, from start to end. Basically, I assume you've read some of the other explanations already.

(However, I did have to add some basic narrative for the flow, so the beginning might indeed feel redundant.)

The most interesting part is the explanation on the difference between using suffix links and rescanning from the root. This is what gave me a lot of bugs and headaches in my implementation.

Open-ended leaf nodes and their limitations

I'm sure you already know that the most fundamental 'trick' is to realize we can just leave the end of the suffixes 'open', i.e. referencing the current length of the string instead of setting the end to a static value. This way when we add additional characters, those characters will be implicitly added to all suffix labels, without having to visit and update all of them.

But this open ending of suffixes – for obvious reasons – works only for nodes that represent the end of the string, i.e. the leaf nodes in the tree structure. The branching operations we execute on the tree (the addition of new branch nodes and leaf nodes) won't propagate automatically everywhere they need to.

It's probably elementary, and wouldn't require mention, that repeated substrings don't appear explicitly in the tree, since the tree already contains these by virtue of them being repetitions; however, when the repetitive substring ends by encountering a non-repeating character, we need to create a branching at that point to represent the divergence from that point onwards.

For example in case of the string 'ABCXABCY' (see below), a branching to X and Y needs to be added to three different suffixes, ABC, BC and C; otherwise it wouldn't be a valid suffix tree, and we couldn't find all substrings of the string by matching characters from the root downwards.

Once again, to emphasize – any operation we execute on a suffix in the tree needs to be reflected by its consecutive suffixes as well (e.g. ABC > BC > C), otherwise they simply cease to be valid suffixes.

Repeating branching in suffixes

But even if we accept that we have to do these manual updates, how do we know how many suffixes need to be updated? Since, when we add the repeated character A (and the rest of the repeated characters in succession), we have no idea yet when/where do we need to split the suffix into two branches. The need to split is ascertained only when we encounter the first non-repeating character, in this case Y (instead of the X that already exists in the tree).

What we can do is to match the longest repeated string we can, and count how many of its suffixes we need to update later. This is what 'remainder' stands for.

The concept of 'remainder' and 'rescanning'

The variable remainder tells us how many repeated characters we added implicitly, without branching; i.e. how many suffixes we need to visit to repeat the branching operation once we found the first character that we cannot match. This essentially equals to how many characters 'deep' we are in the tree from its root.

So, staying with the previous example of the string ABCXABCY, we match the repeated ABC part 'implicitly', incrementing remainder each time, which results in remainder of 3. Then we encounter the non-repeating character 'Y'. Here we split the previously added ABCX into ABC->X and ABC->Y. Then we decrement remainder from 3 to 2, because we already took care of the ABC branching. Now we repeat the operation by matching the last 2 characters – BC – from the root to reach the point where we need to split, and we split BCX too into BC->X and BC->Y. Again, we decrement remainder to 1, and repeat the operation; until the remainder is 0. Lastly, we need to add the current character (Y) itself to the root as well.

This operation, following the consecutive suffixes from the root simply to reach the point where we need to do an operation is what's called 'rescanning' in Ukkonen's algorithm, and typically this is the most expensive part of the algorithm. Imagine a longer string where you need to 'rescan' long substrings, across many dozens of nodes (we'll discuss this later), potentially thousands of times.

As a solution, we introduce what we call 'suffix links'.

The concept of 'suffix links'

Suffix links basically point to the positions we'd normally have to 'rescan' to, so instead of the expensive rescan operation we can simply jump to the linked position, do our work, jump to the next linked position, and repeat – until there are no more positions to update.

Of course one big question is how to add these links. The existing answer is that we can add the links when we insert new branch nodes, utilizing the fact that, in each extension of the tree, the branch nodes are naturally created one after another in the exact order we'd need to link them together. Though, we have to link from the last created branch node (the longest suffix) to the previously created one, so we need to cache the last we create, link that to the next one we create, and cache the newly created one.

One consequence is that we actually often don't have suffix links to follow, because the given branch node was just created. In these cases we have to still fall back to the aforementioned 'rescanning' from root. This is why, after an insertion, you're instructed to either use the suffix link, or jump to root.

(Or alternatively, if you're storing parent pointers in the nodes, you can try to follow the parents, check if they have a link, and use that. I found that this is very rarely mentioned, but the suffix link usage is not set in stones. There are multiple possible approaches, and if you understand the underlying mechanism you can implement one that fits your needs the best.)

The concept of 'active point'

So far we discussed multiple efficient tools for building the tree, and vaguely referred to traversing over multiple edges and nodes, but haven't yet explored the corresponding consequences and complexities.

The previously explained concept of 'remainder' is useful for keeping track where we are in the tree, but we have to realize it doesn't store enough information.

Firstly, we always reside on a specific edge of a node, so we need to store the edge information. We shall call this 'active edge'.

Secondly, even after adding the edge information, we still have no way to identify a position that is farther down in the tree, and not directly connected to the root node. So we need to store the node as well. Let's call this 'active node'.

Lastly, we can notice that the 'remainder' is inadequate to identify a position on an edge that is not directly connected to root, because 'remainder' is the length of the entire route; and we probably don't want to bother with remembering and subtracting the length of the previous edges. So we need a representation that is essentially the remainder on the current edge. This is what we call 'active length'.

This leads to what we call 'active point' – a package of three variables that contain all the information we need to maintain about our position in the tree:

Active Point = (Active Node, Active Edge, Active Length)

You can observe on the following image how the matched route of ABCABD consists of 2 characters on the edge AB (from root), plus 4 characters on the edge CABDABCABD (from node 4) – resulting in a 'remainder' of 6 characters. So, our current position can be identified as Active Node 4, Active Edge C, Active Length 4.

Remainder and Active Point

Another important role of the 'active point' is that it provides an abstraction layer for our algorithm, meaning that parts of our algorithm can do their work on the 'active point', irrespective of whether that active point is in the root or anywhere else. This makes it easy to implement the use of suffix links in our algorithm in a clean and straight-forward way.

Differences of rescanning vs using suffix links

Now, the tricky part, something that – in my experience – can cause plenty of bugs and headaches, and is poorly explained in most sources, is the difference in processing the suffix link cases vs the rescan cases.

Consider the following example of the string 'AAAABAAAABAAC':

Remainder across multiple edges

You can observe above how the 'remainder' of 7 corresponds to the total sum of characters from root, while 'active length' of 4 corresponds to the sum of matched characters from the active edge of the active node.

Now, after executing a branching operation at the active point, our active node might or might not contain a suffix link.

If a suffix link is present: We only need to process the 'active length' portion. The 'remainder' is irrelevant, because the node where we jump to via the suffix link already encodes the correct 'remainder' implicitly, simply by virtue of being in the tree where it is.

If a suffix link is NOT present: We need to 'rescan' from zero/root, which means processing the whole suffix from the beginning. To this end we have to use the whole 'remainder' as the basis of rescanning.

Example comparison of processing with and without a suffix link

Consider what happens at the next step of the example above. Let's compare how to achieve the same result – i.e. moving to the next suffix to process – with and without a suffix link.

Using 'suffix link'

Reaching consecutive suffixes via suffix links

Notice that if we use a suffix link, we are automatically 'at the right place'. Which is often not strictly true due to the fact that the 'active length' can be 'incompatible' with the new position.

In the case above, since the 'active length' is 4, we're working with the suffix 'ABAA', starting at the linked Node 4. But after finding the edge that corresponds to the first character of the suffix ('A'), we notice that our 'active length' overflows this edge by 3 characters. So we jump over the full edge, to the next node, and decrement 'active length' by the characters we consumed with the jump.

Then, after we found the next edge 'B', corresponding to the decremented suffix 'BAA', we finally note that the edge length is larger than the remaining 'active length' of 3, which means we found the right place.

Please note that it seems this operation is usually not referred to as 'rescanning', even though to me it seems it's the direct equivalent of rescanning, just with a shortened length and a non-root starting point.

Using 'rescan'

Reaching consecutive suffixes via rescanning

Notice that if we use a traditional 'rescan' operation (here pretending we didn't have a suffix link), we start at the top of the tree, at root, and we have to work our way down again to the right place, following along the entire length of the current suffix.

The length of this suffix is the 'remainder' we discussed before. We have to consume the entirety of this remainder, until it reaches zero. This might (and often does) include jumping through multiple nodes, at each jump decreasing the remainder by the length of the edge we jumped through. Then finally, we reach an edge that is longer than our remaining 'remainder'; here we set the active edge to the given edge, set 'active length' to remaining 'remainder', and we're done.

Note, however, that the actual 'remainder' variable needs to be preserved, and only decremented after each node insertion. So what I described above assumed using a separate variable initialized to 'remainder'.

Notes on suffix links & rescans

1) Notice that both methods lead to the same result. Suffix link jumping is, however, significantly faster in most cases; that's the whole rationale behind suffix links.

2) The actual algorithmic implementations don't need to differ. As I mentioned above, even in the case of using the suffix link, the 'active length' is often not compatible with the linked position, since that branch of the tree might contain additional branching. So essentially you just have to use 'active length' instead of 'remainder', and execute the same rescanning logic until you find an edge that is shorter than your remaining suffix length.

3) One important remark pertaining to performance is that there is no need to check each and every character during rescanning. Due to the way a valid suffix tree is built, we can safely assume that the characters match. So you're mostly counting the lengths, and the only need for character equivalence checking arises when we jump to a new edge, since edges are identified by their first character (which is always unique in the context of a given node). This means that 'rescanning' logic is different than full string matching logic (i.e. searching for a substring in the tree).

4) The original suffix linking described here is just one of the possible approaches. For example NJ Larsson et al. names this approach as Node-Oriented Top-Down, and compares it to Node-Oriented Bottom-Up and two Edge-Oriented varieties. The different approaches have different typical and worst case performances, requirements, limitations, etc., but it generally seems that Edge-Oriented approaches are an overall improvement to the original.

Solution 4

@jogojapan you brought awesome explanation and visualisation. But as @makagonov mentioned it's missing some rules regarding setting suffix links. It's visible in nice way when going step by step on http://brenden.github.io/ukkonen-animation/ through word 'aabaaabb'. When you go from step 10 to step 11, there is no suffix link from node 5 to node 2 but active point suddenly moves there.

@makagonov since I live in Java world I also tried to follow your implementation to grasp ST building workflow but it was hard to me because of:

  • combining edges with nodes
  • using index pointers instead of references
  • breaks statements;
  • continue statements;

So I ended up with such implementation in Java which I hope reflects all steps in clearer way and will reduce learning time for other Java people:

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

public class ST {

  public class Node {
    private final int id;
    private final Map<Character, Edge> edges;
    private Node slink;

    public Node(final int id) {
        this.id = id;
        this.edges = new HashMap<>();
    }

    public void setSlink(final Node slink) {
        this.slink = slink;
    }

    public Map<Character, Edge> getEdges() {
        return this.edges;
    }

    public Node getSlink() {
        return this.slink;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"id\"")
                .append(":")
                .append(this.id)
                .append(",")
                .append("\"slink\"")
                .append(":")
                .append(this.slink != null ? this.slink.id : null)
                .append(",")
                .append("\"edges\"")
                .append(":")
                .append(edgesToString(word))
                .append("}")
                .toString();
    }

    private StringBuilder edgesToString(final String word) {
        final StringBuilder edgesStringBuilder = new StringBuilder();
        edgesStringBuilder.append("{");
        for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
            edgesStringBuilder.append("\"")
                    .append(entry.getKey())
                    .append("\"")
                    .append(":")
                    .append(entry.getValue().toString(word))
                    .append(",");
        }
        if(!this.edges.isEmpty()) {
            edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
        }
        edgesStringBuilder.append("}");
        return edgesStringBuilder;
    }

    public boolean contains(final String word, final String suffix) {
        return !suffix.isEmpty()
                && this.edges.containsKey(suffix.charAt(0))
                && this.edges.get(suffix.charAt(0)).contains(word, suffix);
    }
  }

  public class Edge {
    private final int from;
    private final int to;
    private final Node next;

    public Edge(final int from, final int to, final Node next) {
        this.from = from;
        this.to = to;
        this.next = next;
    }

    public int getFrom() {
        return this.from;
    }

    public int getTo() {
        return this.to;
    }

    public Node getNext() {
        return this.next;
    }

    public int getLength() {
        return this.to - this.from;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"content\"")
                .append(":")
                .append("\"")
                .append(word.substring(this.from, this.to))
                .append("\"")
                .append(",")
                .append("\"next\"")
                .append(":")
                .append(this.next != null ? this.next.toString(word) : null)
                .append("}")
                .toString();
    }

    public boolean contains(final String word, final String suffix) {
        if(this.next == null) {
            return word.substring(this.from, this.to).equals(suffix);
        }
        return suffix.startsWith(word.substring(this.from,
                this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
    }
  }

  public class ActivePoint {
    private final Node activeNode;
    private final Character activeEdgeFirstCharacter;
    private final int activeLength;

    public ActivePoint(final Node activeNode,
                       final Character activeEdgeFirstCharacter,
                       final int activeLength) {
        this.activeNode = activeNode;
        this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
        this.activeLength = activeLength;
    }

    private Edge getActiveEdge() {
        return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
    }

    public boolean pointsToActiveNode() {
        return this.activeLength == 0;
    }

    public boolean activeNodeIs(final Node node) {
        return this.activeNode == node;
    }

    public boolean activeNodeHasEdgeStartingWith(final char character) {
        return this.activeNode.getEdges().containsKey(character);
    }

    public boolean activeNodeHasSlink() {
        return this.activeNode.getSlink() != null;
    }

    public boolean pointsToOnActiveEdge(final String word, final char character) {
        return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
    }

    public boolean pointsToTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() == this.activeLength;
    }

    public boolean pointsAfterTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() < this.activeLength;
    }

    public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
        return new ActivePoint(this.activeNode, character, 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge() {
        return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
    }

    public ActivePoint moveToSlink() {
        return new ActivePoint(this.activeNode.getSlink(),
                this.activeEdgeFirstCharacter,
                this.activeLength);
    }

    public ActivePoint moveTo(final Node node) {
        return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
    }

    public ActivePoint moveByOneCharacter() {
        return new ActivePoint(this.activeNode,
                this.activeEdgeFirstCharacter,
                this.activeLength + 1);
    }

    public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
                                                                       final char character) {
        return new ActivePoint(node, character, this.activeLength - 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
        return new ActivePoint(this.getActiveEdge().getNext(),
                word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
                this.activeLength - this.getActiveEdge().getLength());
    }

    public void addEdgeToActiveNode(final char character, final Edge edge) {
        this.activeNode.getEdges().put(character, edge);
    }

    public void splitActiveEdge(final String word,
                                final Node nodeToAdd,
                                final int index,
                                final char character) {
        final Edge activeEdgeToSplit = this.getActiveEdge();
        final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
                activeEdgeToSplit.getFrom() + this.activeLength,
                nodeToAdd);
        nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
                new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
                        activeEdgeToSplit.getTo(),
                        activeEdgeToSplit.getNext()));
        nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
        this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
    }

    public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
                           final Node node) {
        if(previouslyAddedNodeOrAddedEdgeNode != null) {
            previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
        }
        return node;
    }

    public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
        return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
    }
  }

  private static int idGenerator;

  private final String word;
  private final Node root;
  private ActivePoint activePoint;
  private int remainder;

  public ST(final String word) {
    this.word = word;
    this.root = new Node(idGenerator++);
    this.activePoint = new ActivePoint(this.root, null, 0);
    this.remainder = 0;
    build();
  }

  private void build() {
    for(int i = 0; i < this.word.length(); i++) {
        add(i, this.word.charAt(i));
    }
  }

  private void add(final int index, final char character) {
    this.remainder++;
    boolean characterFoundInTheTree = false;
    Node previouslyAddedNodeOrAddedEdgeNode = null;
    while(!characterFoundInTheTree && this.remainder > 0) {
        if(this.activePoint.pointsToActiveNode()) {
            if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
                activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    rootNodeHasNotEdgeStartingWithCharacter(index, character);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
                            character, previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
        else {
            if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
                activeEdgeHasCharacter();
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
    }
  }

  private void activeNodeHasEdgeStartingWithCharacter(final char character,
                                                    final Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    this.activePoint = this.activePoint.moveTo(this.root);
    this.remainder--;
    assert this.remainder == 0;
  }

  private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
                                                         final char character,
                                                         Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private void activeEdgeHasCharacter() {
    this.activePoint = this.activePoint.moveByOneCharacter();
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private Node edgeFromRootNodeHasNotCharacter(final int index,
                                             final char character,
                                             Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
            this.word.charAt(index - this.remainder + 2));
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private Node edgeFromInternalNodeHasNotCharacter(final int index,
                                                 final char character,
                                                 Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private ActivePoint walkDown(final int index) {
    while(!this.activePoint.pointsToActiveNode()
            && (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
        if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
        }
        else {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
        }
    }
    return this.activePoint;
  }

  public String toString(final String word) {
    return this.root.toString(word);
  }

  public boolean contains(final String suffix) {
    return this.root.contains(this.word, suffix);
  }

  public static void main(final String[] args) {
    final String[] words = {
            "abcabcabc$",
            "abc$",
            "abcabxabcd$",
            "abcabxabda$",
            "abcabxad$",
            "aabaaabb$",
            "aababcabcd$",
            "ababcabcd$",
            "abccba$",
            "mississipi$",
            "abacabadabacabae$",
            "abcabcd$",
            "00132220$"
    };
    Arrays.stream(words).forEach(word -> {
        System.out.println("Building suffix tree for word: " + word);
        final ST suffixTree = new ST(word);
        System.out.println("Suffix tree: " + suffixTree.toString(word));
        for(int i = 0; i < word.length() - 1; i++) {
            assert suffixTree.contains(word.substring(i)) : word.substring(i);
        }
    });
  }
}

Solution 5

Thanks for the well explained tutorial by @jogojapan, I implemented the algorithm in Python.

A couple of minor problems mentioned by @jogojapan turns out to be more sophisticated than I have expected, and need to be treated very carefully. It cost me several days to get my implementation robust enough (I suppose). Problems and solutions are listed below:

  1. End with Remainder > 0 It turns out this situation can also happen during the unfolding step, not just the end of the entire algorithm. When that happens, we can leave the remainder, actnode, actedge, and actlength unchanged, end the current unfolding step, and start another step by either keep folding or unfolding depending on if the next char in the original string is on the current path or not.

  2. Leap Over Nodes: When we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. We have to move forward to the right place to split, or insert a leaf. This process might be not that straightforward because during the moving the actlength and actedge keep changing all the way, when you have to move back to the root node, the actedge and actlength could be wrong because of those moves. We need additional variable(s) to keep that information.

    enter image description here

The other two problems have somehow been pointed out by @managonov

  1. Split Could Degenerate When trying to split an edge, sometime you'll find the split operation is right on a node. That case we only need add a new leaf to that node, take it as a standard edge split operation, which means the suffix links if there's any, should be maintained correspondingly.

  2. Hidden Suffix Links There is another special case which is incurred by problem 1 and problem 2. Sometimes we need to hop over several nodes to the right point for split, we might surpass the right point if we move by comparing the remainder string and the path labels. That case the suffix link will be neglected unintentionally, if there should be any. This could be avoided by remembering the right point when moving forward. The suffix link should be maintained if the split node already exists, or even the problem 1 happens during a unfolding step.

Finally, my implementation in Python is as follows:

Tips: It includes a naive tree printing function in the code above, which is very important while debugging. It saved me a lot of time and is convenient for locating special cases.

Share:
191,855
Nathan Ridley
Author by

Nathan Ridley

Need a mentor? https://www.codementor.io/nathanridley I'm an independent developer from Brisbane, Australia. My interests and skillset in development cross the board, from UI and graphic design to HTML/CSS and back end development, including database design and development. I am also learning game and graphics development, formerly with C# and SharpDX and lately with C++ and OpenGL. The primary technologies I use for business/web application development are C#, .Net, ASP.Net MVC, Web API, HTML, CSS, JavaScript, SQL Server, MongoDB, Redis, node.js and anything useful that is related. GitHub Projects (coding) Dribbble Profile (design work)

Updated on July 08, 2022

Comments

  • Nathan Ridley
    Nathan Ridley almost 2 years

    I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.

    A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.

    For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

    My basic understanding, so far:

    • I need to iterate through each prefix P of a given string T
    • I need to iterate through each suffix S in prefix P and add that to tree
    • To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.

    The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.

    I'm also having trouble understanding:

    • exactly when and how the "active point" is assigned, used and changed
    • what is going on with the canonization aspect of the algorithm
    • Why the implementations I've seen need to "fix" bounding variables that they are using

    Here is the completed C# source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

    https://gist.github.com/2373868


    Update 2017-11-04

    After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript. Gist is below. It should be bug-free. Dump it into a js file, npm install chalk from the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.

    https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6

    • Ivaylo Strandjev
      Ivaylo Strandjev about 12 years
      I have had quite a lot problems to implement this data structure myself. In the end I found this article and managed to implement it. A great plus for it is that you have a step-by-step example for a quite long string so you get to see what you should do. Please take a look at the article and I will be more then happy to give tips where needed. I am hesitant to gove yet another full-blown explanation as there are alredy quite a few round the internet.
    • jogojapan
      jogojapan about 12 years
      Did you take a look at the description given in Dan Gusfield's book? I found that to be helpful.
    • Nathan Ridley
      Nathan Ridley about 12 years
      I don't have a copy of that book unfortunately...
    • maruan
      maruan about 12 years
      About the implementation and tests at gist.github.com/2373868. The code works not properly on string 'dedododeeodo$', because in your output suffix tree we cannot find suffix 'do$'. The reason is that the linked node for active node node #3 is not null but node #1 (check the output for iteration 12).
    • Yuri Astrakhan
      Yuri Astrakhan over 11 years
      The gist does not specify the license - can I change your code and republish under MIT (obviously with attributions)?
    • Nathan Ridley
      Nathan Ridley over 11 years
      Yep, go for your life. Consider it public domain. As mentioned by another answer on this page, there's a bug that needs fixing anyway.
    • Jim Balter
      Jim Balter over 10 years
      "The basic algorithm appears to be O(n2)" -- Good grief. No, the whole point of suffix trees is that their construction is O(n). See suffixtree.org
    • AnyOneElse
      AnyOneElse over 10 years
      @Nathan Ridley: your link suffixtree.nathanridley.com seems to be dead. Shame, I would have loved to see it.
    • cos
      cos over 10 years
      maybe this implementation will help others, goto code.google.com/p/text-indexing
    • James Youngman
      James Youngman almost 8 years
      "Consider it public domain" is, perhaps surprisingly a very unhelpful answer. The reason is that it's actually impossible for you to place the work in the public domain. Hence your "consider it..." comment underlines the fact that the license is unclear and gives the reader reason to doubt that the status of the work is actually clear to you. If you would like people to be able to use your code, please specify a license for it, choose any license you like (but, unless you're a lawyer, choose a pre-existing license!)
    • ihadanny
      ihadanny over 7 years
      isn't the naive implementation $O(n^3)$? I think you forgot that for each suffix of a prefix you have to find it first in the trie - which is $O(n)$ as well...
  • jogojapan
    jogojapan about 12 years
    Sorry this ended up a little longer than I hoped. And I realize it explains an number of trivial things we all know, while the difficult parts might still not be perfectly clear. Let's edit it into shape together.
  • jogojapan
    jogojapan about 12 years
    And I should add that this is not based on the description found in Dan Gusfield's book. It's a new attempt at describing the algorithm by first considering a string with no repetitions and then discussing how repetitions are handled. I hoped that would be more intuitive.
  • Hari
    Hari about 12 years
    That is such an awesome answer by jogojapan. And I too faced a difficult time trying to understand and implement it. On a side note, one problem I face is validating my implementation of the algorithm - on two occasions I ran the code and found bugs through some test cases.
  • Nathan Ridley
    Nathan Ridley about 12 years
    Thanks @jogojapan, I was able to write a fully-working example thanks to your explanation. I've published the source so hopefully somebody else may find it of use: gist.github.com/2373868
  • Nathan Ridley
    Nathan Ridley about 12 years
    @jogojapan - one question - why do we care if the remainder is greater than zero at the end? I tested ABCABXABCA vs ABCABXABCA$ and the altered tree doesn't seem to have any benefit with regards to searchability.
  • Nathan Ridley
    Nathan Ridley about 12 years
    @jogojapan thanks, that makes sense now. The final bit I'm looking at, as I think I haven't accounted for it in my code, is the point you're addressing in your final supplementary diagram. My main problem is trying to come up with a character string that will cause this scenario to eventuate, such that I can test for it. Do you have any ideas?
  • jogojapan
    jogojapan about 12 years
    @NathanRidley Yes (by the way, that final bit is what Ukkonen calls canonicize). One way to trigger it is to make sure there is a substring that appears three times and ends in a string that appears one more time in yet a different context. E.g. abcdefabxybcdmnabcdex. The initial part of abcd is repeated in abxy (this creates an internal node after ab) and again in abcdex, and it ends in bcd, which appears not only in the bcdex context, but also in the bcdmn context. After abcdex is inserted, we follow the suffix link to insert bcdex, and there canonicize will kick in.
  • Nathan Ridley
    Nathan Ridley about 12 years
    Ah, great thanks! My initial code was horrible and had bugs; I'm writing a newer version at the moment and will post the updated code when it's done, so that hopefully others viewing your explanation will have some working code to follow along with.
  • Nathan Ridley
    Nathan Ridley about 12 years
    Ok my code has been completely rewritten and now works correctly for all cases, including automatic canonization, plus has a much nicer text graph output. gist.github.com/2373868
  • Nathan Ridley
    Nathan Ridley about 12 years
    One question about the logic surrounding walking down the tree when, after an iteration, the active length is equal to or greater than the length of the current edge. My understanding is that I should walk down the tree, updating the active node for each new node reached and the active edge for the current suffix, and reducing the active length each time until the active length becomes zero or less than the length of the active edge. Doing this, my code correctly handles "abcabxabcd", "mississippi" and "abcdefabxybcdmnabcdex", but fails with "abcadak". Output here: bit.ly/HVVVUO
  • jogojapan
    jogojapan about 12 years
    Your description of the logic (which is the canonize function) is accurate. The problem (or one problem) is that in iteration #6, after having dealt with a{k}, your code does not go to the root node. In fact the error is in my wording of Rule 3 above: The rule should apply after any insert from an internal node, not only when an edge was split. I will correct it soon... let's see if you find further problems first.
  • Nathan Ridley
    Nathan Ridley about 12 years
    Great, thanks, that did the trick! I have updated my gist code and sample output and added 5 more sample outputs as well for verification. gist.github.com/2373868
  • Nathan Ridley
    Nathan Ridley about 12 years
    @jogojapan - I think there might be something missing from your instructions. Taking your GraphViz samples, I have updated my code and created a web-based visualiser that generates Dot graphs for every step of the process. suffixtree.nathanridley.com - try the string "aabaaa" - note that the suffix a$ is missing from the final graph. Not quite sure why that might happen, any ideas?
  • jogojapan
    jogojapan over 11 years
    @MrWiredancer Thanks for your suggested edit for step 8. Unfortunately the edit was rejected as too minor by three people before I saw it. But you were right, the active node in step 8 was incorrect. I have corrected it now.
  • makagonov
    makagonov over 11 years
    @jogojapan Thanks a lot for such an understandable approach of Ukkonen's algorithm! Unfortunately, it failed in my case for string "aabaaabb" due to Rule 2, which states that nodes can be connected with a suffix link if they are created during the current step. Please, find a more detailed step-by-step description at link. My source-code in Java: link. If you've got an implementation of ST with your approach, please, could you share it? That would save much time for everyone reading your post. Thank you.
  • jogojapan
    jogojapan over 11 years
    @makagonov Did you look at Nathan Ridley's implementation?
  • makagonov
    makagonov over 11 years
    @jogojapan I've finally done it with your approach. The problem was that you didn't mention two more cases when we need to add a suffix link: 1) when we create a leaf and 2) when there is an edge existing. Anyway, adding suffix links ONLY to the nodes created during the current step is not enough. A detailed implementation of your approach in Java is here. A problem on Timus Online Judge with good test coverage passed all the tests.
  • makagonov
    makagonov over 11 years
    @jogojapan I've posted an answer below which is based on your approach. I made some modifications to the rules and added one important observation.
  • jogojapan
    jogojapan over 11 years
    Thanks very much and +1 for you effort. I am sure you are right.. although I don't have the time to think about the details right away. I'll check later and possibly modify my answer then as well.
  • Thibaut Colar
    Thibaut Colar about 11 years
    First off, thanks so much for this amazingly detailed explanation. Quick question: Is it possible to carry "payload" in a ukkonen tree, meaning say I want to store a number for each given prefix. I'm unusre since nodes can be implicit ... would it have to be carried on the edge then ? Thank you.
  • jogojapan
    jogojapan about 11 years
    @ThibautColar Yes, that's difficult. It works easily when the information is attached to a suffix of the string, or a substring that occurs more than once (because this means there must be a branching node in the suffix tree, so you can attach the info to that node). But general substrings / prefixes are only implicit. You could add it to the edge, but there may then more than one infos per edge, and the total size (number of nodes and edges) would, in general, no longer be O(n).
  • Grijesh Chauhan
    Grijesh Chauhan almost 11 years
    this is an awesome! I never saw such a great explanation, this is the best answer one can get.
  • zinking
    zinking over 10 years
    for Chinese translations visit this: ibaiyang.org/2013/01/06/suffix-tree-introduction
  • Filipe Gonçalves
    Filipe Gonçalves about 10 years
    Follow-up question from someone who can't comment: stackoverflow.com/questions/23303289/…
  • Adi Inbar
    Adi Inbar about 10 years
    @jogojapan Please see this question that was just posted seeking clarification about your answer.
  • dyesdyes
    dyesdyes almost 10 years
    Thanks so much, it really helped. Though, could you be more specific on Observation 3? For instance, giving the diagrams of the 2 steps that introduce the new suffix link. Is the node linked the active node? (as we don't actually insert the 2nd node)
  • tariq zafar
    tariq zafar over 9 years
    @makagonov Hey can you help me build suffix tree for your string "cdddcdc" I am bit confused doing so (the starting steps).
  • Prahalad Deshpande
    Prahalad Deshpande over 9 years
    I wish there was a facility in SO to upvote multiple times :) :). Solid answer to an equally solid question!!
  • Stan R.
    Stan R. about 9 years
    this is a great answer, on one of the more difficult data-structures to understand. this definitely helped me wrap my head around it. thank you.
  • sqd
    sqd about 9 years
    As for rule 3, a smart way is to set the suffix link of root to root itself, and (in default) set the suffix link of every node to root. Thus we can avoid the conditioning and just follow the suffix link.
  • SexyBeast
    SexyBeast about 9 years
    Foe big blobs of text, will standard search libraries like Sphinx or Lucene be faster at pattern searching than what can be done using just this algorithm and building a suffix tree and searching it for the pattern?
  • jogojapan
    jogojapan about 9 years
    @Cupidvogel It depends on what you mean by "pattern search". Sphinx and Lucene use inverted files, so they need to tokenize the text. You can only search for tokens or combinations of tokens. If you are dealing with natural language text and want search results ranking by relevance, then that is most likely what you need, and it will be very fast, and also much smaller (in RAM and on disk) than a suffix tree or suffix array.
  • SexyBeast
    SexyBeast about 9 years
    Thanks. So for what purposes are suffix tree bases solution like this are more suitable? By pattern search I simply mean substring search.
  • jogojapan
    jogojapan about 9 years
    Whenever tokenization isn't possible, e.g. when searching gene sequences rather than natural language, suffix trees (or better suffix arrays) are an option, esp. when the search pattern is very long. They are also good when it comes to identifying patterns like repeated substrings, or substrings that are squares or palindromes and such like. There are various specialized applications in bioinformatics. However, I don't think the Ukkonen algorithm desribed here is used very much in practice these days; I would guess that most people interested in it are interested for educational reasons.
  • SexyBeast
    SexyBeast about 9 years
    Yep. In spite of your more than awesome explanation, I am having great difficulty in understanding how it all works.. :) So say I have to add a search functionality for a text book, say big sized book at that. You should be able to search text inside it just like you can in a PDF document. Here Ukonen's algo won't be a fit one you say?
  • jogojapan
    jogojapan about 9 years
    A single book, even if big, is still relatively little content. Also if this is part of a single application run by one user on a single device for individual searches, a search index (Lucene-style or suffix tree) would seem to be overkill. I'd use a direct string search algorithm based on KMP. It's quite likely that simply using a string search function from the standard library of your programming language, or linking to something like "grep" would work perfectly.
  • Egor Okhterov
    Egor Okhterov over 8 years
    It would be just awesome to have similarly structured answer to the Dijkstra's shortest-path algorithm.
  • Nathan Ridley
    Nathan Ridley over 8 years
    @Pixar It would be awesome to have a website that is dedicated to detailed, plain-english explanations of algorithms like this.
  • elif
    elif over 8 years
    Great explanation! Just one thing I'd like to add. I am still trying to understand, but as far as I understood, this explanation skips one edge case: "abca". The case where the last digit was repeated in the string before. In that case we need to append an end character to the string, sth like '$'. Check the 5th paragraph in cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm
  • rampion
    rampion over 8 years
    A note on this: "However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below." In general, you can use a finalized suffix tree to report the number of occurrences of a particular substring by just following the path from the root to a particular node or edge defined by that substring, then counting the number of leaves in that subtree. But that property doesn't hold for non-finalized suffixtrees.
  • Tony
    Tony about 8 years
    Are you sure? You says 'we might omit adding some nodes to the tree due to the Rule 3', but your example can't show that. Without your new link, I can still get right tree, but just with one more step. Can you come up with another example show that? Also @jogojapan?
  • wilson100
    wilson100 about 8 years
    Really appreciate. This is the most understandable explanation of Ukkonen's algorithm I can found on web. I try to implement based on this but found out there are some corner cases missing, because the suffix link does not set up correctly. like "cdddcdc"
  • Blair Houghton
    Blair Houghton almost 8 years
    One of the academic papers defines "proper" as meaning the "proper suffix" of a string doesn't contain its first character. Sometimes you call a whole substring a "suffix", but when defining the algorithm the terms "string" and "substring" and "suffix" get thrown around liberally and sometimes you need to be very clear what you mean by "suffix", so the term "proper suffix" excludes calling the whole thing a suffix. So a suffix substring of a string can be any legitimate substring and can have a proper suffix that isn't the same suffix. Because logic.
  • NICK
    NICK almost 8 years
    Hello @makagonov, maybe this is a bad place to ask, but i ported your example code to Rust for my project. It running like a charm, however, your code didn't have a license attach to it, so I don't know if I could actually use it in my projects or not. Please clarify the permission of your code, thank you! My port is here: gist.github.com/raincious/cb0b6b48efc3c63d6532c329fe362112
  • rd22
    rd22 over 7 years
    This is a great explanation but I understood it more clearly and quickly from here youtu.be/F3nbY3hIDLQ?t=37m19s ,this is a lecture from MIT open courseware, he explains compressed trie first and then goes on explain suffix trees. I will say this is a must watch.
  • cbare
    cbare over 7 years
    Fantastic writing, @jogojapan. As far as I can tell, Gusfield's String Algorithms book takes the approach of starting with a fully correct but inefficient cubic-time algorithm and progressively improving its performance to linear. This answer starts with a linear algorithm that only handles trivial cases (strings with no repetition) then adds logic to handle progressively more difficult cases. Nice work!
  • Nathan Ridley
    Nathan Ridley over 6 years
    @jogojapan Came back to this after 5+ years and have added a JS implementation, in case you're ever looking for a working reference (linked in the question).
  • exebook
    exebook about 6 years
    Lost you at step four, where ac comes from?
  • jogojapan
    jogojapan about 6 years
    @exebook I don't see any ac in step 4... Can you explain what you are referring to?
  • exebook
    exebook about 6 years
    @jogojapan sorry for this confusion, I was scrolling for too long to get to the last comment and by that time mistook ac for ca.
  • jogojapan
    jogojapan about 6 years
    @exebook The # is a variable that in step 4 takes on the meaning of 4, because that is the current position. So edge [2,#] takes on the meaning of [2,4], and the string beginning at position 2 and ending at position 4 is ca (because c is the 3rd, and a the 4th character of the string).
  • asv
    asv almost 5 years
    When there is no a suffix link we set the active node to the root node, in this case the down walk on the tree could be more than the ramainder? Besides, if in a phase with ramonder r we insert k nodes for the firste node inserted we could walk down for r-1 for the second r-2 and so on. Then we have r-1+...r-k steps, how to achieve from that an overall O(n)?
  • IvanaGyro
    IvanaGyro almost 5 years
    Is there any case that proves this extra action is required? I have a guess. In rule 3, after the active node is set to the root, the tuple, (active_node; active_edge; active_length), should be updated to find the correct active_node. The function of this process is as the same as the function of the extra suffix link that this post mentioned.
  • IvanaGyro
    IvanaGyro almost 5 years
    aabaacaad is one of the cases that shows adding extra suffix link can reduce the times of updating the triple. The conclusion in the last two paragraphs of the post of jogojapan is wrong. If we do not add the suffix links this post mentioned, the average time complexity should be O(nlong(n)) or more. Because it takes extra time to walk the tree to find the correct active_node.
  • Shayan
    Shayan about 3 years
    Very good explanation; But, I've found the rules very unhelpful. As you've mentioned and gave an example that we may need to update the active node after going through a suffix link, it's actually more complicated than that. We need to generalize that idea. When ever going through a suffix link, or even when a node doesn't have a suffix link, we need to use dfs (as some calls it "walk down".) Consider the word "aabbabab$". At some point, you'll need to go from a node with prefix "bab" to a node with prefix "ab". The suffix link won't exist till that point. The only way is to use dfs.
  • Amar Prakash Pandey
    Amar Prakash Pandey over 2 years
    Thanks for this explanation!
  • Abhinav Atul
    Abhinav Atul over 2 years
    It seems activeNode means suffixNode that also contains the max proper suffix of the currentSuffix as prefix, for, e.g., in case of "abc" the activeNode represents the '\0' (null node) or the empty string, in case of "abca", this changes to the node that starts with 'a' as that contains the max suffix ("a") of the current suffix ("abca") as prefix Renaming activeNode to nodeWithMaxProperSuffixOfCurrentSuffixAsPrefix might be a good idea.
  • Jake
    Jake almost 2 years
    Why are you making it seem like there's a 3-tuple of variables plus a remainder variable, instead of saying there's 4 variables?
  • jogojapan
    jogojapan almost 2 years
    @Jake The triple (i.e. the active_point refers to a position inside the suffix tree, whereas the remainder refers to the input string. I just wanted to keep these two separate conceptually. Technically of course you do it as a quadruple, or as four separate variables, or anything else.