Shortest path to transform one word into another

algorithm shortest-path edit-distance hamming-distance

30,863

Solution 1

NEW ANSWER

Given the recent update, you could try A* with the Hamming distance as a heuristic. It's an admissible heuristic since it's not going to overestimate the distance

OLD ANSWER

You can modify the dynamic-program used to compute the Levenshtein distance to obtain the sequence of operations.

EDIT: If there are a constant number of strings, the problem is solvable in polynomial time. Else, it's NP-hard (it's all there in wikipedia) .. assuming your friend is talking about the problem being NP-hard.

EDIT: If your strings are of equal length, you can use Hamming distance.

Solution 2

With a dictionary, BFS is optimal, but the running time needed is proportional to its size (V+E). With n letters, the dictionary might have ~a^n entires, where a is alphabet size. If the dictionary contains all words but the one that should be on the end of chain, then you'll traverse all possible words but won't find anything. This is graph traversal, but the size might be exponentially large.

You may wonder if it is possible to do it faster - to browse the structure "intelligently" and do it in polynomial time. The answer is, I think, no.

The problem:

You're given a fast (linear) way to check if a word is in dictionary, two words u, v and are to check if there's a sequence u -> a₁ -> a₂ -> ... -> a_n -> v.

is NP-hard.

Proof: Take some 3SAT instance, like

(p or q or not r) and (p or not q or r)

You'll start with 0 000 00 and are to check if it is possible to go to 2 222 22.

The first character will be "are we finished", three next bits will control p,q,r and two next will control clauses.

Allowed words are:

Anything that starts with 0 and contains only 0's and 1's
Anything that starts with 2 and is legal. This means that it consists of 0's and 1's (except that the first character is 2, all clauses bits are rightfully set according to variables bits, and they're set to 1 (so this shows that the formula is satisfable).
Anything that starts with at least two 2's and then is composed of 0's and 1's (regular expression: 222* (0+1)*, like 22221101 but not 2212001

To produce 2 222 22 from 0 000 00, you have to do it in this way:

(1) Flip appropriate bits - e.g. 0 100 111 in four steps. This requires finding a 3SAT solution.

(2) Change the first bit to 2: 2 100 111. Here you'll be verified this is indeed a 3SAT solution.

(3) Change 2 100 111 -> 2 200 111 -> 2 220 111 -> 2 222 111 -> 2 222 211 -> 2 222 221 -> 2 222 222.

These rules enforce that you can't cheat (check). Going to 2 222 22 is possible only if the formula is satisfable, and checking that is NP-hard. I feel it might be even harder (#P or FNP probably) but NP-hardness is enough for that purpose I think.

Edit: You might be interested in disjoint set data structure. This will take your dictionary and group words that can be reached from each other. You can also store a path from every vertex to root or some other vertex. This will give you a path, not neccessarily the shortest one.

Solution 3

There are methods of varying efficiency for finding links - you can construct a complete graph for each word length, or you can construct a BK-Tree, for example, but your friend is right - BFS is the most efficient algorithm.

There is, however, a way to significantly improve your runtime: Instead of doing a single BFS from the source node, do two breadth first searches, starting at either end of the graph, and terminating when you find a common node in their frontier sets. The amount of work you have to do is roughly half what is required if you search from only one end.

Solution 4

You can make it a little quicker by removing the words that are not the right length, first. More of the limited dictionary will fit into the CPU's cache. Probably all of it.

Also, all of the strncmp comparisons (assuming you made everything lowercase) can be memcmp comparisons, or even unrolled comparisons, which can be a speedup.

You could use some preprocessor magic and hard-compile the task for that word-length, or roll a few optimized variations of the task for common word lengths. All of those extra comparisons can 'go away' for pure unrolled fun.

Solution 5

What you are looking for is called the Edit Distance. There are many different types.

From (http://en.wikipedia.org/wiki/Edit_distance): "In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other."

This article about Jazzy (the java spell check API) has a nice overview of these sorts of comparisons (it's a similar problem - providing suggested corrections) http://www.ibm.com/developerworks/java/library/j-jazzy/

View more solutions

30,863

Ash Van

Updated on July 09, 2022

Comments

Ash Van almost 2 years
For a Data Structures project, I must find the shortest path between two words (like "cat" and "dog"), changing only one letter at a time. We are given a Scrabble word list to use in finding our path. For example:
```
cat -> bat -> bet -> bot -> bog -> dog
```
I've solved the problem using a breadth first search, but am seeking something better (I represented the dictionary with a trie).

Please give me some ideas for a more efficient method (in terms of speed and memory). Something ridiculous and/or challenging is preferred.

I asked one of my friends (he's a junior) and he said that there is no efficient solution to this problem. He said I would learn why when I took the algorithms course. Any comments on that?

We must move from word to word. We cannot go cat -> dat -> dag -> dog. We also have to print out the traversal.
- samiretas over 14 years
  
  In your example, why is bet in there? You changed the same letter twice in a row, it should read: cat -> bat -> bot -> bog -> dog
- Rusty Rob over 11 years
  
  duplicate stackoverflow.com/questions/11811918/…
- V. Rubinetti over 9 years
  
  Dacman, would you mind sharing how much your performance improved using the heuristics vs. BFS?
- dhughes almost 5 years
  
  On a side note, the shortest path from cat to dog is cat >> cot >> cog >> dog :)
Zed over 14 years

Given the example that should be Hamming distance.
Nick Johnson over 14 years

You can't modify the Levenshtein function to do this, because you have a limited dictionary of valid words - and so the shortest valid path could be very much longer than the number of characters in the string.
Harsh Pathak over 14 years

Right, the word->word concept was only clear in the recent edit.
Ash Van over 14 years

Yes, sorry about that. What if I were to use a Branch and Bound algorithm, using the Hamming distance to bound?
Harsh Pathak over 14 years

I was thinking of A* with the Hamming distance as a heuristic. It's definitely admissible since it's not going to overestimate the distance.
Nick Johnson over 14 years

This depends on how you're measuring runtime. A* could get there faster - but in pathological cases, it could also take longer than the BFS (or dual BFS - see my answer) solution.
Davy8 over 14 years

shameless plug I actually asked a question a month or two ago about this problem except in Python, and I ended up using A* with Hamming distance as the heuristic (although I didn't know it was called Hamming Distance at the time). I got some really good optimizations (some python specific, some not). The question is here stackoverflow.com/questions/788084/… and in the question updates I linked to a blog-post I made (about halfway down... there were a lot of updates to the question) where I described the optimizations along with the code.
Joe Liversedge over 14 years

Great summary. If the original author is looking for something really creative, the edit distance can be used in conjunction with the word-reach graph as a fitness function for a genetic algorithm. The output being the path through the graph from one start word to the end word, so the best answer would be the shortest. (While cool, this would find the answer the fastest, but will not yield a definitive answer. Very TS.) I'd stick with the real world. Eliminate cycles, enumerate the paths and find the 'best' one using the above suggestions. This is tagged 'Java' so try JGraphT.
Ash Van over 14 years

I did this and it worked wonderfully well. Amazed my professor too!
Harsh Pathak over 14 years

Good to know! Algorithms is always fun :)
sidgeon smythe over 14 years

Cool, not often does one see NP-hardness proofs in Stackoverflow answers. :-) I too suspect this problem is harder than NP (PSPACE-complete?) if the dictionary is given simply as a membership oracle... but if the dictionary is actually given in the input, then the problem can trivially be done in polynomial time, as the dictionary size is part of the input (that's the flaw in your NP-hardness proof).
sidgeon smythe over 14 years

No it is not. Read the question carefully. There is a fixed given dictionary, so the edit distance has very little relevance.
V. Rubinetti over 9 years

Dacman, would you mind sharing how much your performance improved using the heuristics vs. BFS?
V. Rubinetti over 9 years

Note that this method only works for unweighted graphs, I believe. On weighted graphs (where some edges are "more cost" or longer than others), using a bi-directional search in the same manner doesn't guarantee the shortest path as been found. Check out this link and this topic. But in this case, the steps between one-letter-different words are all the same
kybernetikos about 9 years

Jacob, can you say more about why it's admissable. I had assumed it wasn't for the same reason that @Nick Jonhson gave - i.e. the hamming distance may say that two words have a hamming distance of 2, but the intermediates you need might not be in the dictionary, so the real distance is larger than 2. (e.g. abc -> dbe, with dictionary abc, xbc, xyc, xye, dye, dbe gives hamming distance 2 and minimum path length of 5)