String similarity score/hash

algorithm hash similarity

22,275

Solution 1

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

Solution 2

Levenstein distance or its derivatives is the algorithm you want. Match given string to each of strings from dictionary. (Here, if you need only fixed number of most similar strings, you may want to use min-heap.) If running Levenstein distance for all strings in dictionary is too expensive, then use some rough algorithm first that will exclude too distant words from list of candidates. After that, run levenstein distance on left candidates.

One way to remove distant words is to index n-grams. Preprocess dictionary by splitting each of words into list of n-grams. For example, consider n=3:

(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]

Next, create index of n-gramms:

" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]

When you need to find most similar strings for given string, you split given string into n-grams and select only those words from dictionary which have at least one matching n-gram. This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.

If your strings are long enough, you may reduce index size by using min-hashing technnique: you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.

P.S. this presentation seems like a good introduction to your problem.

Solution 3

This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.

For example, you cannot assign numbers to these three phrases:

one two
one six
two six

Such that the numbers reflect the difference between all three phrases.

Solution 4

While the idea seems extremely sweet... I've never heard of this.

I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.

There are fairly elaborated technics, the one I am currently working on combines:

A Bursted Trie, with level compactness
A Levenshtein Automaton

Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.

If you ever find such a method, I am extremely interested :)

Solution 5

In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.

Imagine similarity at the character level

stops
spots

hello world
world hello

In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)

Then compare the following similar messages

hello world
ello world

Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.

In the case of

spots
stops

The words only differ by position of the characters, so some form of position is important.

If the following strings are similar

 yesssssssssssssss
 yessssssssssssss

Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.

In general this is treated as a multi-dimensional problem - breaking the string into a vector

[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]

But the values of the vector can not be

represented by a fixed size number, or
give good quality difference measure.

If the number of words, or length of strings were bounded, then a solution of coding may be possible.

Bounded values

Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.

data mining solution

If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.

I have code for such at github : clustering

Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.

Edit Distance or Levenshtein distance

This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.

View more solutions

22,275

Author by

Josef Sábl

Updated on April 17, 2020

Comments

Josef Sábl about 4 years
Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

Let's consider these strings and scores as an example:
```
Hello world                1000
Hello world!               1010
Hello earth                1125
Foo bar                    3250
FooBarbar                  3750
Foo Bar!                   3300
Foo world!                 2350
```
You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.
Nikita Rybak over 13 years

It is exactly what it is: distance. It doesn't give you any characteristic to one string, it can be used only to compare two particular strings.
Josef Sábl over 13 years

Not bad, but: a) my strings don't contain only words, b) I will get soundexes, ok, but how do I compare if they are similar :-)
Josef Sábl over 13 years

Nikita is right, that's the problem. Other than that, it would be just what I need.
Karl Knechtel over 13 years

A one-dimensional "characteristic" won't work, because if distance(a,b) = 1 and distance(b,c) = 1, it does not follow that distance(a,c) = 2. What are you really trying to do?
DougW over 12 years

I'm going to get a little Information Theory-y here and argue that you have actually just done what you claim is impossible. Each of those strings can be represented as a binary number (ie integer), and you have just proven that you're able to identify structure in that number that describes what you call "difference". I think the question really being asked is, is there a simpler set of numbers we could map strings to that would, without loss, represent the complete set of possible relationships. This is essentially the Kolmogorov complexity of the search space.
DougW over 12 years

Clearly it's impossible to have a lower Kolmogorov complexity for arbitrary strings with arbitrary definitions of "alike". I would argue that it is, however, entirely possible that you could reduce that complexity by restricting the set or definition (ie, only English language strings). It's possible that the complexity of that question is vastly smaller than the complexity of the unbounded one and, possibly, mappable to a smaller integer space.
Nick Johnson over 12 years

@DougW I'm afraid I don't follow your argument. Komologorov complexity is irrelevant here; the issue is dimensionality. Can you assign a number to each of those phrases such that the distance between the numbers reflects the edit distance between the strings?
DougW over 12 years

Yeah it's a lot to type into a comment! My point is that "integers" in this case need not be restricted to single dimensional, linear comparisons as you're suggesting. The reductio ad absurdum of your statement is that you yourself have already given "numbers" which contain all the dimensionality necessary to make any comparison you want. However that is the extreme. What if we could map the ENTIRE space of string characteristics we care about to two bits, where each bit represented some relationship dimension? Probably we can't, but what about 3 bits? 4? Where do we draw the line?
DougW over 12 years

This is where the K complexity comes into play. If we consider "likeness" to be "the number of letter As", then the complexity of that space is small enough to map to an easily understood number. As we increase the complexity of our definition of "like", the K complexity of the problem space increases, and the representation becomes more complex as well. We may reduce this complexity by discarding unimportant information. It's very much like representing the Earth on a 2D map. We can map to lower dimensions at the expensive of information and still approximate the info we care about.
Nick Johnson over 12 years

@DougW There's two problems with what you're suggesting. First, as I've already pointed out, string similarity is a metric space, and doesn't have a fixed dimension. If you try and categorise string differences into a fixed number of dimensions, there will always be a counter-example that doesn't work.
Nick Johnson over 12 years

In response to your other assertion, yes, a string is a number, but the operation the OP is specifically asking for is "This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value." - that is, integers under subtraction. That requires reducing each string to a single dimension, which isn't possible as I already demonstrated.
DougW over 12 years

Sure, I'm not arguing with your answer for unbounded, arbitrary string comparisons. My point is that you're making an assumption about what "alike" means, so your answer begs the question. One can easily come up with definitions (or a set of definitions) for "likeness" that CAN be mapped to single dimensions, or a small group of them. Would your answer change if I defined "likeness" as "the number of vowels"? The frequency of a given n-gram? We could come up with a small set of these questions that are each measurable and, together, create a signature that achieves the spirit of the question.
Nick Johnson over 12 years

@DougW Sure, I'm making an assumption, but it's an eminently reasonable one - that the OP meant to use some sort of edit distance measure. And you cannot come up with definitions that can be mapped to a fixed set of dimensions and still preserve edit distance - that's the whole point of saying it's a metric space.
DougW over 12 years

Alright, clearly that's your interpretation of the question and I respect that. Personally I don't see anything in his question that indicates he cares about "edit distance". He cares about similarity of strings (see his response to mbeckish in the comments), which I believe is a different question and can be at least approximated by hashing (see my own answer).
Alvin over 10 years

i was looking for something like this but now i doubt that it's possible because it's Levenshtein distance and it's not similar to euclidean distance where dist(a,c) = dist(a,b) +/- dist(b,c).
rocketspacer about 8 years

This works if you choose a base-string and calculate the distance from it to all of your string, checkout my answer
rocketspacer about 8 years

calculate edit distance, choose: base="one two"; ED(base, "one two") = 0; ED(base, "one six") = 6; ED(base, "two six") = 8; the number doesnt represent the value of the string universally, but it can somewhat determine the order of differences in your given collection
Nick Johnson about 8 years

@nmutan You haven't defined ED("one six", "two six"). As soon as you do, you'll find that at least one pair has an integer difference that's inconsistent with its edit distance.
ArtOfWarfare almost 7 years

@KarlKnechtel - Even if it was one-dimensional, it could be that a and c are the same and so distance(a,c) would be 0 even though the distance from either of them to b is 1.
Javasick about 6 years

that is brilliant!