Optimizing Jaro-Winkler algorithm

11,748

Solution 1

Yes, but you aren't going to enjoy it. Replace all those newed StringBuffers with char arrays that are allocated in the constructor and never again, using integer indices to keep track of what's in them.

This pending Commons-Lang patch will give you some of the flavor.

Solution 2

I know this question has probably been solved for some time, but I would like to comment on the algorithm itself. When comparing a string against itself, the answer turns out to be 1/|string| off. When comparing slightly different values, the values also turn out to be lower.

The solution to this is to adjust 'm-1' to 'm' in the inner for-statement within the getCommonCharacters method. The code then works like a charm :)

See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance as well for some examples.

Share:
11,748

Related videos on Youtube

Pentium10
Author by

Pentium10

Backend engineer, team leader, Google Developer Expert in Cloud, scalability, APIs, BigQuery, mentor, consultant. To contact: message me under my username at gm ail https://kodokmarton.com

Updated on April 22, 2022

Comments

  • Pentium10
    Pentium10 about 2 years

    I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Android mobile device.

    Can it be optimized more?

    public class Jaro {
        /**
         * gets the similarity of the two strings using Jaro distance.
         *
         * @param string1 the first input string
         * @param string2 the second input string
         * @return a value between 0-1 of the similarity
         */
        public float getSimilarity(final String string1, final String string2) {
    
            //get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
            final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2);
    
            //get common characters
            final StringBuffer common1 = getCommonCharacters(string1, string2, halflen);
            final StringBuffer common2 = getCommonCharacters(string2, string1, halflen);
    
            //check for zero in common
            if (common1.length() == 0 || common2.length() == 0) {
                return 0.0f;
            }
    
            //check for same length common strings returning 0.0f is not the same
            if (common1.length() != common2.length()) {
                return 0.0f;
            }
    
            //get the number of transpositions
            int transpositions = 0;
            int n=common1.length();
            for (int i = 0; i < n; i++) {
                if (common1.charAt(i) != common2.charAt(i))
                    transpositions++;
            }
            transpositions /= 2.0f;
    
            //calculate jaro metric
            return (common1.length() / ((float) string1.length()) +
                    common2.length() / ((float) string2.length()) +
                    (common1.length() - transpositions) / ((float) common1.length())) / 3.0f;
        }
    
        /**
         * returns a string buffer of characters from string1 within string2 if they are of a given
         * distance seperation from the position in string1.
         *
         * @param string1
         * @param string2
         * @param distanceSep
         * @return a string buffer of characters from string1 within string2 if they are of a given
         *         distance seperation from the position in string1
         */
        private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) {
            //create a return buffer of characters
            final StringBuffer returnCommons = new StringBuffer();
            //create a copy of string2 for processing
            final StringBuffer copy = new StringBuffer(string2);
            //iterate over string1
            int n=string1.length();
            int m=string2.length();
            for (int i = 0; i < n; i++) {
                final char ch = string1.charAt(i);
                //set boolean for quick loop exit if found
                boolean foundIt = false;
                //compare char with range of characters to either side
    
                for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) {
                    //check if found
                    if (copy.charAt(j) == ch) {
                        foundIt = true;
                        //append character found
                        returnCommons.append(ch);
                        //alter copied string2 for processing
                        copy.setCharAt(j, (char)0);
                    }
                }
            }
            return returnCommons;
        }
    }
    

    I mention that in the whole process I make just instance of the script, so only once

    jaro= new Jaro();
    

    If you are going to test and need examples so not break the script, you will find it here, in another thread for python optimization