How to check if two words are anagrams

174,455

Solution 1

Fastest algorithm would be to map each of the 26 English characters to a unique prime number. Then calculate the product of the string. By the fundamental theorem of arithmetic, 2 strings are anagrams if and only if their products are the same.

Solution 2

Two words are anagrams of each other if they contain the same number of characters and the same characters. You should only need to sort the characters in lexicographic order, and determine if all the characters in one string are equal to and in the same order as all of the characters in the other string.

Here's a code example. Look into Arrays in the API to understand what's going on here.

public boolean isAnagram(String firstWord, String secondWord) {
     char[] word1 = firstWord.replaceAll("[\\s]", "").toCharArray();
     char[] word2 = secondWord.replaceAll("[\\s]", "").toCharArray();
     Arrays.sort(word1);
     Arrays.sort(word2);
     return Arrays.equals(word1, word2);
}

Solution 3

If you sort either array, the solution becomes O(n log n). but if you use a hashmap, it's O(n). tested and working.

char[] word1 = "test".toCharArray();
char[] word2 = "tes".toCharArray();

Map<Character, Integer> lettersInWord1 = new HashMap<Character, Integer>();

for (char c : word1) {
    int count = 1;
    if (lettersInWord1.containsKey(c)) {
        count = lettersInWord1.get(c) + 1;
    }
    lettersInWord1.put(c, count);
}

for (char c : word2) {
    int count = -1;
    if (lettersInWord1.containsKey(c)) {
        count = lettersInWord1.get(c) - 1;
    }
    lettersInWord1.put(c, count);
}

for (char c : lettersInWord1.keySet()) {
    if (lettersInWord1.get(c) != 0) {
        return false;
    }
}

return true;

Solution 4

Here's a simple fast O(n) solution without using sorting or multiple loops or hash maps. We increment the count of each character in the first array and decrement the count of each character in the second array. If the resulting counts array is full of zeros, the strings are anagrams. Can be expanded to include other characters by increasing the size of the counts array.

class AnagramsFaster{

    private static boolean compare(String a, String b){
        char[] aArr = a.toLowerCase().toCharArray(), bArr = b.toLowerCase().toCharArray();
        if (aArr.length != bArr.length)
            return false;
        int[] counts = new int[26]; // An array to hold the number of occurrences of each character
        for (int i = 0; i < aArr.length; i++){
            counts[aArr[i]-97]++;  // Increment the count of the character at i
            counts[bArr[i]-97]--;  // Decrement the count of the character at i
        }
        // If the strings are anagrams, the counts array will be full of zeros
        for (int i = 0; i<26; i++)
            if (counts[i] != 0)
                return false;
        return true;
    }

    public static void main(String[] args){
        System.out.println(compare(args[0], args[1]));
    }
}

Solution 5

Lots of people have presented solutions, but I just want to talk about the algorithmic complexity of some of the common approaches:

  • The simple "sort the characters using Arrays.sort()" approach is going to be O(N log N).

  • If you use radix sorting, that reduces to O(N) with O(M) space, where M is the number of distinct characters in the alphabet. (That is 26 in English ... but in theory we ought to consider multi-lingual anagrams.)

  • The "count the characters" using an array of counts is also O(N) ... and faster than radix sort because you don't need to reconstruct the sorted string. Space usage will be O(M).

  • A "count the characters" using a dictionary, hashmap, treemap, or equivalent will be slower that the array approach, unless the alphabet is huge.

  • The elegant "product-of-primes" approach is unfortunately O(N^2) in the worst case This is because for long-enough words or phrases, the product of the primes won't fit into a long. That means that you'd need to use BigInteger, and N times multiplying a BigInteger by a small constant is O(N^2).

    For a hypothetical large alphabet, the scaling factor is going to be large. The worst-case space usage to hold the product of the primes as a BigInteger is (I think) O(N*logM).

  • A hashcode based approach is usually O(N) if the words are not anagrams. If the hashcodes are equal, then you still need to do a proper anagram test. So this is not a complete solution.


I would also like to highlight that most of the posted answers assume that each code-point in the input string is represented as a single char value. This is not a valid assumption for code-points outside of the BMP (plane 0); e.g. if an input string contains emojis.

The solutions that make the invalid assumption will probably work most of the time anyway. A code-point outside of the BMP will represented in the string as two char values: a low surrogate and a high surrogate. If the strings contain only one such code-point, we can get away with treating the char values as if they were code-points. However, we can get into trouble when the strings being tested contain 2 or more code-points. Then the faulty algorithms will fail to distinguish some cases. For example, [SH1, SL1, SH2, SL2] versus [SH1, SL2, SH2, SL1] where the SH<n> and SL<2> denote high and low surrogates respectively. The net result will be false anagrams.

Alex Salauyou's answer gives a couple of solutions that will work for all valid Unicode code-points.

Share:
174,455
Chilli
Author by

Chilli

Updated on July 25, 2022

Comments

  • Chilli
    Chilli almost 2 years

    I have a program that shows you whether two words are anagrams of one another. There are a few examples that will not work properly and I would appreciate any help, although if it were not advanced that would be great, as I am a 1st year programmer. "schoolmaster" and "theclassroom" are anagrams of one another, however when I change "theclassroom" to "theclafsroom" it still says they are anagrams, what am I doing wrong?

    import java.util.ArrayList;
    public class AnagramCheck {
        public static void main(String args[]) {
            String phrase1 = "tbeclassroom";
            phrase1 = (phrase1.toLowerCase()).trim();
            char[] phrase1Arr = phrase1.toCharArray();
    
            String phrase2 = "schoolmaster";
            phrase2 = (phrase2.toLowerCase()).trim();
            ArrayList<Character> phrase2ArrList = convertStringToArraylist(phrase2);
    
            if (phrase1.length() != phrase2.length()) {
                System.out.print("There is no anagram present.");
            } else {
                boolean isFound = true;
                for (int i = 0; i < phrase1Arr.length; i++) {
                    for (int j = 0; j < phrase2ArrList.size(); j++) {
                        if (phrase1Arr[i] == phrase2ArrList.get(j)) {
                            System.out.print("There is a common element.\n");
                            isFound =;
                            phrase2ArrList.remove(j);
                        }
                    }
                    if (isFound == false) {
                        System.out.print("There are no anagrams present.");
                        return;
                    }
                }
                System.out.printf("%s is an anagram of %s", phrase1, phrase2);
            }
        }
    
        public static ArrayList<Character> convertStringToArraylist(String str) {
            ArrayList<Character> charList = new ArrayList<Character>();
            for (int i = 0; i < str.length(); i++) {
                charList.add(str.charAt(i));
            }
            return charList;
        }
    }
    
  • MPogoda
    MPogoda over 11 years
    @jb. I was just fixing logical issues
  • Chilli
    Chilli over 11 years
    Thanks so much for that! Solved all my problems straight away! :D
  • Chilli
    Chilli over 11 years
    Brilliant help! Thanks for that! :D
  • Admin
    Admin about 11 years
    AFAIK sorting twice is N(Log N)*N(Log N). This algorithm requires NN. Also, the second loop is not always executed which brings it average between N and NN. I did some rough testing with System.nanoTime() time diff, which shows that sorting is slower. Maybe I am overlooking something....
  • Admin
    Admin about 11 years
    correction: O(N LogN) + O(N Log N) vs O(N) < performance < O(N+N)
  • vonbrand
    vonbrand about 11 years
    It is $O(N \log N)$ versus $O(N^2)$, actually.
  • Admin
    Admin about 11 years
    Maybe you can help me out why this is the case and why performance is slower for sort. I looked at Skiena's algorithm design manual, and agree with NlogN. The manual states that O(f(n)) +O(g(n)) = O(max(f(n),g(n))). Hence, the result becomes O(NlogN) vs N? Maybe it is me, but I do not see why this algorithm is N^2 as I do not see any nested loops unless they are hidden somewhere in JAVA? Also, why does it have lower CPU time?
  • Dejell
    Dejell about 11 years
    Why does it answer the question?
  • assylias
    assylias about 11 years
    Fast probably but fastest probably not. In that case why not simply use two char[26]? And it has the obvious caveat that it only works for English.
  • SeriousBusiness
    SeriousBusiness about 11 years
    should work for all characters. Also it's O(n), which is faster than sorting.
  • le-doude
    le-doude almost 11 years
    Your algorithm runs in O(NlogN) (N is the lenght of both the strings together) there is a O(N) algorithm using O(N/2)~O(N) memory. See my answer
  • Koray Tugay
    Koray Tugay over 10 years
    I think you should make all the characters to lowercase as well.
  • Koray Tugay
    Koray Tugay over 10 years
    Two phrases are anagrams if they are permutations of each other, ignoring spaces and capitalization. cs.duke.edu/csed/algoprobs/anagram.html
  • tucuxi
    tucuxi over 10 years
    For a long phrase you would quickly overflow 64 bits, and BigDecimals are /slow/; an int[26] is a much safer (and faster) bet.
  • tucuxi
    tucuxi over 10 years
    You need to specify <Character, Integer> or it will not compile. Autoboxing can take care of the rest
  • Eric Woodruff
    Eric Woodruff over 10 years
    Why not first compare the lengths of each word before sorting them?
  • Makoto
    Makoto over 10 years
    @EricWoodruff: This probably wasn't intended as a 100% complete example; there are obvious optimizations in comparing lengths to begin with as well as forcing everything to be the same case. This was meant to be illustrative as opposed to cut-and-paste valid code.
  • Shubhankar Raj
    Shubhankar Raj about 10 years
    Can you kindly clarify the segment for(char c : word1){ lettersInWord1.get(c)++; } for(char c : word2) { lettersInWord1.get(c)--; }
  • maria nabil
    maria nabil about 10 years
    I'm adding up how many letters of each type are in the first word, "1 a, 2 b's, 6 d's" etc. Then I subtract the count of how many letters of each type are in the second word. If the letters are anagrams, they have the same number of letters of the same type. So if the map has all zero's, they are anagrams.
  • anshulkatta
    anshulkatta almost 10 years
    for every put in the hashmap , hashcode and equals are called...is this really good for big strings...i think sort will be much faster than this in that case
  • anshulkatta
    anshulkatta almost 10 years
    but the map is empty , you have not put the characters in
  • Zoran Horvat
    Zoran Horvat almost 10 years
    Solution with hash set takes O(n) time and O(m) additional space, where m is total number of distinct characters in two strings. Basically, m should not exceed 100 or so if strings are plain text. This means that hash set solution reduces asymptotic execution time from O(nlogn) to O(n) and reduces added space from O(n) to practically O(1). This should have significant impact in large strings.
  • greybeard
    greybeard almost 10 years
    Is "ac" an anagram of "bb"? Also, how many hash codes are there, and how many combinations of characters?
  • Ripu Daman
    Ripu Daman almost 10 years
    I got your point. I did the little modification. It is generating the hash codes at runtime for any combination of string of unicode characters.
  • greybeard
    greybeard almost 10 years
    While your overload of equals does implement an equivalence relation, it throws on null == obj instead of returning false. Worse, it does not implement equality wrt anagrams. Take strings over an alphabet with just 10 characters of numerical value 1 to 10 (to facilitate "multiplicative hashing"). Let your hashcode be one decimal digit, and consider strings of two characters, only: how many equivalence classes are there? Once you have more classes than codes, no amount of cleverness in encoding will save your day.
  • Stephen C
    Stephen C over 9 years
    This approach can only do the "not an anagram" test fast. If two words / phrases have the same hashcode, you still have to do a longer test to see if they are algorithms.
  • Ripu Daman
    Ripu Daman over 9 years
    @greybeard I didn't implement checks for null here because that wasn't the intent here. I was mainly focusing on finding the anagram easily. However I didn't understand your other comment.
  • Ripu Daman
    Ripu Daman over 9 years
    @StephenC As per this approach, two strings will have same hash code only if they are anagram.
  • Stephen C
    Stephen C over 9 years
    @ripudam - That is incorrect. In fact, it is simple to prove that it is incorrect. There 2^32 possible values for an int hashcode. The proof follows directly from that.
  • Ripu Daman
    Ripu Daman over 9 years
    @StephenC This is a great point but that doesn't show that this program will not work for. However, to be on the safer side, I added the length equality check. Can you provide the inputs for which this program wouldn't work?
  • greybeard
    greybeard over 9 years
    This is about checking strings (or even Strings) for being anagrams, not about providing counter examples for implementations of approaches proven wrong. In Unicode, any letter from the latin alphabet has an encoding 32 bigger than that of its capitalisation. Repeat each 2^26 times, check with AnagramTest.isAnagram. With that out of the way, do somebody a favour: how many equivalence classes are there for strings of length two from an alphabet of size ten? There can be no injective function from a set into a smaller one, let alone from a (countably) infinite set into any finite set.
  • Stephen C
    Stephen C over 9 years
    @ripudam - If you are interested, you can generate an example for yourself. Start with a word (say "anagramatic") and iterate over all other alphabetic strings of the same length ... until you find one that has the same hashcode. Just a few billion iterations should be enough to find one. Once you have found one, the chances are that that string won't be an anagram of "anagramatic". But why bother ...
  • SuRa
    SuRa over 9 years
    I am getting Invalid argument to operation ++/-- for the line lettersInWord1.get(c)++; lettersInWord1.get(c)--;
  • seriously divergent
    seriously divergent about 9 years
    Can't we just use the character values instead of prime numbers, I think it will also work, isnt it?
  • chakming
    chakming almost 9 years
    @user2900314 I've add an answer to the 'AD'/'AC'-'BB' issue stackoverflow.com/questions/15045640/…
  • Ar5hv1r
    Ar5hv1r almost 9 years
    Both String.indexOf() and String.replaceFirst() are O(n) operations where n is the length of the string. You can't just "ignore" certain function calls; this is an O(n^2) solution.
  • okello
    okello almost 9 years
    You prompted me to do some research on this. And you are right. My solution has a time complexity of O(n^2). Whereas the String.indexOf(..) might be fast in practice, the String.replaceFirst() seems slower, especially in loops. It can be sped up, though, by using regex directly where a pre-compiled regex pattern is used in the loop. The String.replace*(..) seems to use regex under the hood and actually compiles a new regex pattern on every call. Thanks a lot for pointing this out.
  • Codebender
    Codebender almost 9 years
    This wont work for strings "E" = 5 and "CD" = 3,4. Or for that matter any Pythagorean triplet.
  • chakming
    chakming almost 9 years
    @Codebender Length is checked before comparison
  • John
    John over 8 years
    Found a youtube video explaining the exact same code. youtube.com/watch?v=jjMIy3_ix-Q
  • Admin
    Admin over 8 years
    @user2900314 Key observation is that it must work for prime numbers. Anagrams will have the same "prime factorization".
  • chakming
    chakming over 8 years
    plus, int value of alphabetical char is hard to cause such possible cases
  • Makoto
    Makoto over 8 years
    @John: Thanks for spotting that. I hadn't noticed or paid it much attention before, but that's good to know. Seems like it carries the same issues as pointed out prior, anyway.
  • mur7ay
    mur7ay over 8 years
    This should be the best answer.
  • Alex Salauyou
    Alex Salauyou over 8 years
    How will it work? The first .get(c)++ will throw an NPE
  • maria nabil
    maria nabil over 8 years
    "this is untested, might be some syntax errors" I'll update it to be correct syntax
  • Paul Hankin
    Paul Hankin about 8 years
    I disagree with comments that this is O(N log N). Java's sort (for primitive types) is a quicksort variant, and quicksort runs in worst case O(N) time if the objects to be sorted are drawn from a finite set (such as values of type char).
  • Valentin Montmirail
    Valentin Montmirail over 7 years
    That's a beautiful use of Prime Numbers ;) congrats !
  • Rajavel D
    Rajavel D over 7 years
    Is this code work properly poll & pool. It will return as true but its not.
  • Bill the Lizard
    Bill the Lizard about 7 years
    This is clever, but not the fastest. Multiplication is not O(n).
  • miracle173
    miracle173 about 7 years
    this is an awful post that shouldn't be upvoted (or even downvoted) as long it contains the claim it is the fastes algorithm. the question is a duplicated and there the accepted solutione is a fast one. But the idea with the primes is nice. A similar idea can be found in this video that solves a puzzle presented here: the instant insanity puzzle
  • localhost
    localhost almost 7 years
    Last iteration can be simplified: iterating over lettersInWord1.values() and check for each value for zero. Also, second iteration can be optimized by quick return (false) if !contains.
  • Amardeep Kumar
    Amardeep Kumar almost 7 years
    We can just use the ASCII value summation. If it matches then it is Anagram. check here: stackoverflow.com/a/45236219/1425181
  • Tarek
    Tarek over 6 years
    just a quick little improvement, I'd add the following test, that checks if the two strings are equal (ignoring case), as that would make them, naturally, anagrams: if (a.equalsIgnoreCase(b)) return true.
  • Shanu Gupta
    Shanu Gupta about 6 years
    Perfect O(n) and O(1) solution.
  • Karl
    Karl over 5 years
    If we compare the length of the strings at the start to make sure they are equal, can't we assume that the last iteration is not needed as long as we check for when the count in the second iteration goes below 0?
  • jkonst
    jkonst over 5 years
    The cost of Arrays.sort is O(nlogn) > O(n)
  • Vhndaree
    Vhndaree over 5 years
    any sum of the square of 3 numbers will never equal to the sum of the square of other 3 numbers. is it just your assumption or proved somewhere.
  • HASSAN MD TAREQ
    HASSAN MD TAREQ almost 5 years
    How about using 2 Hashsets?
  • pkuderov
    pkuderov about 4 years
    The proposed method is wrong as it compares the sum of character codes of the strings. If two strings are anagrams, then they have the same sum, But the opposite is not true in general.
  • xrisk
    xrisk almost 4 years
    Do not do this please. This is a party trick, not actual code to be used.
  • kcsquared
    kcsquared over 2 years
    This still doesn't work; "add" and "ebb" both have 29,409 as the sum of squares of characters.
  • phatfingers
    phatfingers over 2 years
    In Java, the "^" operator means XOR, so I don't think this code does what is intended.
  • chux - Reinstate Monica
    chux - Reinstate Monica over 2 years
    Code can fail when 1) encoding is not ASCII, 2) single letter occurrence exceeds INT_MAX, 3) Other locales with more than 26 letters or multiple case mappings to A-Z. Other than that, looks good.
  • chux - Reinstate Monica
    chux - Reinstate Monica over 2 years
    First 26 primes [2...97] need 7 bits, so a 64-bit integer product well handles up to a 9 A-Z letter word. Past that, extended math incurs its own cost past O(n).
  • chux - Reinstate Monica
    chux - Reinstate Monica over 2 years
    @AmardeepKumar ASCII value summation is insufficient. "03", "12" has same ASCII sum yet are not anagrams.