How to compare almost similar Strings in Java? (String distance measure)

41,008

Solution 1

The Levensthein distance is a measure for how similar strings are. Or, more precisely, how many alterations have to be made that they are the same.

The algorithm is available in pseudo-code on Wikipedia. Converting that to Java shouldn't be much of a problem, but it's not built-in into the base class library.

Wikipedia has some more algorithms that measure similarity of strings.

Solution 2

The following Java libraries offer multiple compare algorithms (Levenshtein,Jaro Winkler,...):

  1. Apache Commons Lang 3: https://commons.apache.org/proper/commons-lang/
  2. Simmetrics: http://sourceforge.net/projects/simmetrics/

Both libraries have a java documentation (Apache Commons Lang Javadoc,Simmetrics Javadoc).

//Usage of Apache Commons Lang 3
import org.apache.commons.lang3.StringUtils;   
public double compareStrings(String stringA, String stringB) {
    return StringUtils.getJaroWinklerDistance(stringA, stringB);
}

 //Usage of Simmetrics
import uk.ac.shef.wit.simmetrics.similaritymetrics.JaroWinkler    
public double compareStrings(String stringA, String stringB) {
    JaroWinkler algorithm = new JaroWinkler();
    return algorithm.getSimilarity(stringA, stringB);
}

Solution 3

yeah thats a good metric, you could use StringUtil.getLevenshteinDistance() from apache commons

Solution 4

You can find implementations of Levenshtein and other string similarity/distance measures on https://github.com/tdebatty/java-string-similarity

If your project uses maven, installation is as simple as

<dependency>
  <groupId>info.debatty</groupId>
  <artifactId>java-string-similarity</artifactId>
  <version>RELEASE</version>
</dependency>

Then, to use Levenshtein for example

import info.debatty.java.stringsimilarity.*;

public class MyApp {

  public static void main (String[] args) {
    Levenshtein l = new Levenshtein();

    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
  }
}

Solution 5

Shameless plug, but I wrote a library also:

https://github.com/vickumar1981/stringdistance

It has all these functions, plus a few for phonetic similarity (if one word "sounds like" another word - returns either true or false unlike the other fuzzy similarities which are numbers between 0-1).

Also includes dna sequencing algorithms like Smith-Waterman and Needleman-Wunsch which are generalized versions of Levenshtein.

I plan, in the near future, on making this work with any array and not just strings (an array of characters).

Share:
41,008
hsmit
Author by

hsmit

Updated on July 09, 2022

Comments

  • hsmit
    hsmit almost 2 years

    I would like to compare two strings and get some score how much these look alike. For example "The sentence is almost similar" and "The sentence is similar".

    I'm not familiar with existing methods in Java, but for PHP I know the levenshtein function.

    Are there better methods in Java?

  • hsmit
    hsmit over 14 years
    It is not available in Java Mobile Edition, is it? But thanks for your response!
  • jspcal
    jspcal over 14 years
    you can use it with ME, just add the jar.
  • Valentin Rocher
    Valentin Rocher over 14 years
    hmmm, no, I'm not really so sure it's completely usable with J2ME, it has been compiled with a J2SE
  • jspcal
    jspcal over 14 years
    it doesn't use anything ME doesn't support. you can make and copy in the jar
  • bluevoid
    bluevoid about 10 years
    super lib, easy to use and good results
  • Ian Jones
    Ian Jones almost 10 years
    Its available in Apache commons-lang now: commons.apache.org/proper/commons-lang/apidocs/org/apache/…
  • peater
    peater almost 9 years
    A library based on this is now on GitHub github.com/Simmetrics/simmetrics. It is also available on Maven Central