How to check if two words are anagrams

java string algorithm anagram

174,455

Solution 1

Fastest algorithm would be to map each of the 26 English characters to a unique prime number. Then calculate the product of the string. By the fundamental theorem of arithmetic, 2 strings are anagrams if and only if their products are the same.

Solution 2

Two words are anagrams of each other if they contain the same number of characters and the same characters. You should only need to sort the characters in lexicographic order, and determine if all the characters in one string are equal to and in the same order as all of the characters in the other string.

Here's a code example. Look into Arrays in the API to understand what's going on here.

public boolean isAnagram(String firstWord, String secondWord) {
     char[] word1 = firstWord.replaceAll("[\\s]", "").toCharArray();
     char[] word2 = secondWord.replaceAll("[\\s]", "").toCharArray();
     Arrays.sort(word1);
     Arrays.sort(word2);
     return Arrays.equals(word1, word2);
}

Solution 3

If you sort either array, the solution becomes O(n log n). but if you use a hashmap, it's O(n). tested and working.

char[] word1 = "test".toCharArray();
char[] word2 = "tes".toCharArray();

Map<Character, Integer> lettersInWord1 = new HashMap<Character, Integer>();

for (char c : word1) {
    int count = 1;
    if (lettersInWord1.containsKey(c)) {
        count = lettersInWord1.get(c) + 1;
    }
    lettersInWord1.put(c, count);
}

for (char c : word2) {
    int count = -1;
    if (lettersInWord1.containsKey(c)) {
        count = lettersInWord1.get(c) - 1;
    }
    lettersInWord1.put(c, count);
}

for (char c : lettersInWord1.keySet()) {
    if (lettersInWord1.get(c) != 0) {
        return false;
    }
}

return true;

Solution 4

Here's a simple fast O(n) solution without using sorting or multiple loops or hash maps. We increment the count of each character in the first array and decrement the count of each character in the second array. If the resulting counts array is full of zeros, the strings are anagrams. Can be expanded to include other characters by increasing the size of the counts array.

class AnagramsFaster{

    private static boolean compare(String a, String b){
        char[] aArr = a.toLowerCase().toCharArray(), bArr = b.toLowerCase().toCharArray();
        if (aArr.length != bArr.length)
            return false;
        int[] counts = new int[26]; // An array to hold the number of occurrences of each character
        for (int i = 0; i < aArr.length; i++){
            counts[aArr[i]-97]++;  // Increment the count of the character at i
            counts[bArr[i]-97]--;  // Decrement the count of the character at i
        }
        // If the strings are anagrams, the counts array will be full of zeros
        for (int i = 0; i<26; i++)
            if (counts[i] != 0)
                return false;
        return true;
    }

    public static void main(String[] args){
        System.out.println(compare(args[0], args[1]));
    }
}

Solution 5

Lots of people have presented solutions, but I just want to talk about the algorithmic complexity of some of the common approaches:

The simple "sort the characters using Arrays.sort()" approach is going to be O(N log N).
If you use radix sorting, that reduces to O(N) with O(M) space, where M is the number of distinct characters in the alphabet. (That is 26 in English ... but in theory we ought to consider multi-lingual anagrams.)
The "count the characters" using an array of counts is also O(N) ... and faster than radix sort because you don't need to reconstruct the sorted string. Space usage will be O(M).
A "count the characters" using a dictionary, hashmap, treemap, or equivalent will be slower that the array approach, unless the alphabet is huge.
The elegant "product-of-primes" approach is unfortunately O(N^2) in the worst case This is because for long-enough words or phrases, the product of the primes won't fit into a long. That means that you'd need to use BigInteger, and N times multiplying a BigInteger by a small constant is O(N^2).

For a hypothetical large alphabet, the scaling factor is going to be large. The worst-case space usage to hold the product of the primes as a BigInteger is (I think) O(N*logM).
A hashcode based approach is usually O(N) if the words are not anagrams. If the hashcodes are equal, then you still need to do a proper anagram test. So this is not a complete solution.

I would also like to highlight that most of the posted answers assume that each code-point in the input string is represented as a single char value. This is not a valid assumption for code-points outside of the BMP (plane 0); e.g. if an input string contains emojis.

The solutions that make the invalid assumption will probably work most of the time anyway. A code-point outside of the BMP will represented in the string as two char values: a low surrogate and a high surrogate. If the strings contain only one such code-point, we can get away with treating the char values as if they were code-points. However, we can get into trouble when the strings being tested contain 2 or more code-points. Then the faulty algorithms will fail to distinguish some cases. For example, [SH1, SL1, SH2, SL2] versus [SH1, SL2, SH2, SL1] where the SH<n> and SL<2> denote high and low surrogates respectively. The net result will be false anagrams.

Alex Salauyou's answer gives a couple of solutions that will work for all valid Unicode code-points.

View more solutions

174,455

Author by

Chilli

Updated on July 25, 2022

Comments

Chilli almost 2 years

I have a program that shows you whether two words are anagrams of one another. There are a few examples that will not work properly and I would appreciate any help, although if it were not advanced that would be great, as I am a 1st year programmer. "schoolmaster" and "theclassroom" are anagrams of one another, however when I change "theclassroom" to "theclafsroom" it still says they are anagrams, what am I doing wrong?

import java.util.ArrayList;
public class AnagramCheck {
    public static void main(String args[]) {
        String phrase1 = "tbeclassroom";
        phrase1 = (phrase1.toLowerCase()).trim();
        char[] phrase1Arr = phrase1.toCharArray();

        String phrase2 = "schoolmaster";
        phrase2 = (phrase2.toLowerCase()).trim();
        ArrayList<Character> phrase2ArrList = convertStringToArraylist(phrase2);

        if (phrase1.length() != phrase2.length()) {
            System.out.print("There is no anagram present.");
        } else {
            boolean isFound = true;
            for (int i = 0; i < phrase1Arr.length; i++) {
                for (int j = 0; j < phrase2ArrList.size(); j++) {
                    if (phrase1Arr[i] == phrase2ArrList.get(j)) {
                        System.out.print("There is a common element.\n");
                        isFound =;
                        phrase2ArrList.remove(j);
                    }
                }
                if (isFound == false) {
                    System.out.print("There are no anagrams present.");
                    return;
                }
            }
            System.out.printf("%s is an anagram of %s", phrase1, phrase2);
        }
    }

    public static ArrayList<Character> convertStringToArraylist(String str) {
        ArrayList<Character> charList = new ArrayList<Character>();
        for (int i = 0; i < str.length(); i++) {
            charList.add(str.charAt(i));
        }
        return charList;
    }
}

MPogoda over 11 years

@jb. I was just fixing logical issues
Chilli over 11 years

Thanks so much for that! Solved all my problems straight away! :D
Chilli over 11 years

Brilliant help! Thanks for that! :D
Admin about 11 years

AFAIK sorting twice is N(Log N)*N(Log N). This algorithm requires NN. Also, the second loop is not always executed which brings it average between N and NN. I did some rough testing with System.nanoTime() time diff, which shows that sorting is slower. Maybe I am overlooking something....
Admin about 11 years

correction: O(N LogN) + O(N Log N) vs O(N) < performance < O(N+N)
vonbrand about 11 years

It is $O(N \log N)$ versus $O(N^2)$, actually.
Admin about 11 years

Maybe you can help me out why this is the case and why performance is slower for sort. I looked at Skiena's algorithm design manual, and agree with NlogN. The manual states that O(f(n)) +O(g(n)) = O(max(f(n),g(n))). Hence, the result becomes O(NlogN) vs N? Maybe it is me, but I do not see why this algorithm is N^2 as I do not see any nested loops unless they are hidden somewhere in JAVA? Also, why does it have lower CPU time?
Dejell about 11 years

Why does it answer the question?
assylias about 11 years

Fast probably but fastest probably not. In that case why not simply use two char[26]? And it has the obvious caveat that it only works for English.
SeriousBusiness about 11 years

should work for all characters. Also it's O(n), which is faster than sorting.
le-doude almost 11 years

Your algorithm runs in O(NlogN) (N is the lenght of both the strings together) there is a O(N) algorithm using O(N/2)~O(N) memory. See my answer
Koray Tugay over 10 years

I think you should make all the characters to lowercase as well.
Koray Tugay over 10 years

Two phrases are anagrams if they are permutations of each other, ignoring spaces and capitalization. cs.duke.edu/csed/algoprobs/anagram.html
tucuxi over 10 years

For a long phrase you would quickly overflow 64 bits, and BigDecimals are /slow/; an int[26] is a much safer (and faster) bet.
tucuxi over 10 years

You need to specify <Character, Integer> or it will not compile. Autoboxing can take care of the rest
Eric Woodruff over 10 years

Why not first compare the lengths of each word before sorting them?
Makoto over 10 years

@EricWoodruff: This probably wasn't intended as a 100% complete example; there are obvious optimizations in comparing lengths to begin with as well as forcing everything to be the same case. This was meant to be illustrative as opposed to cut-and-paste valid code.
Shubhankar Raj about 10 years

Can you kindly clarify the segment for(char c : word1){ lettersInWord1.get(c)++; } for(char c : word2) { lettersInWord1.get(c)--; }
maria nabil about 10 years

I'm adding up how many letters of each type are in the first word, "1 a, 2 b's, 6 d's" etc. Then I subtract the count of how many letters of each type are in the second word. If the letters are anagrams, they have the same number of letters of the same type. So if the map has all zero's, they are anagrams.
anshulkatta almost 10 years

for every put in the hashmap , hashcode and equals are called...is this really good for big strings...i think sort will be much faster than this in that case
anshulkatta almost 10 years

but the map is empty , you have not put the characters in
Zoran Horvat almost 10 years

Solution with hash set takes O(n) time and O(m) additional space, where m is total number of distinct characters in two strings. Basically, m should not exceed 100 or so if strings are plain text. This means that hash set solution reduces asymptotic execution time from O(nlogn) to O(n) and reduces added space from O(n) to practically O(1). This should have significant impact in large strings.
greybeard almost 10 years

Is "ac" an anagram of "bb"? Also, how many hash codes are there, and how many combinations of characters?
Ripu Daman almost 10 years

I got your point. I did the little modification. It is generating the hash codes at runtime for any combination of string of unicode characters.
greybeard almost 10 years

While your overload of equals does implement an equivalence relation, it throws on null == obj instead of returning false. Worse, it does not implement equality wrt anagrams. Take strings over an alphabet with just 10 characters of numerical value 1 to 10 (to facilitate "multiplicative hashing"). Let your hashcode be one decimal digit, and consider strings of two characters, only: how many equivalence classes are there? Once you have more classes than codes, no amount of cleverness in encoding will save your day.
Stephen C over 9 years

This approach can only do the "not an anagram" test fast. If two words / phrases have the same hashcode, you still have to do a longer test to see if they are algorithms.
Ripu Daman over 9 years

@greybeard I didn't implement checks for null here because that wasn't the intent here. I was mainly focusing on finding the anagram easily. However I didn't understand your other comment.
Ripu Daman over 9 years

@StephenC As per this approach, two strings will have same hash code only if they are anagram.
Stephen C over 9 years

@ripudam - That is incorrect. In fact, it is simple to prove that it is incorrect. There 2^32 possible values for an int hashcode. The proof follows directly from that.
Ripu Daman over 9 years

@StephenC This is a great point but that doesn't show that this program will not work for. However, to be on the safer side, I added the length equality check. Can you provide the inputs for which this program wouldn't work?
greybeard over 9 years

This is about checking strings (or even Strings) for being anagrams, not about providing counter examples for implementations of approaches proven wrong. In Unicode, any letter from the latin alphabet has an encoding 32 bigger than that of its capitalisation. Repeat each 2^26 times, check with AnagramTest.isAnagram. With that out of the way, do somebody a favour: how many equivalence classes are there for strings of length two from an alphabet of size ten? There can be no injective function from a set into a smaller one, let alone from a (countably) infinite set into any finite set.
Stephen C over 9 years

@ripudam - If you are interested, you can generate an example for yourself. Start with a word (say "anagramatic") and iterate over all other alphabetic strings of the same length ... until you find one that has the same hashcode. Just a few billion iterations should be enough to find one. Once you have found one, the chances are that that string won't be an anagram of "anagramatic". But why bother ...
SuRa over 9 years

I am getting Invalid argument to operation ++/-- for the line lettersInWord1.get(c)++; lettersInWord1.get(c)--;
seriously divergent about 9 years

Can't we just use the character values instead of prime numbers, I think it will also work, isnt it?
chakming almost 9 years

@user2900314 I've add an answer to the 'AD'/'AC'-'BB' issue stackoverflow.com/questions/15045640/…
Ar5hv1r almost 9 years

Both String.indexOf() and String.replaceFirst() are O(n) operations where n is the length of the string. You can't just "ignore" certain function calls; this is an O(n^2) solution.
okello almost 9 years

You prompted me to do some research on this. And you are right. My solution has a time complexity of O(n^2). Whereas the String.indexOf(..) might be fast in practice, the String.replaceFirst() seems slower, especially in loops. It can be sped up, though, by using regex directly where a pre-compiled regex pattern is used in the loop. The String.replace*(..) seems to use regex under the hood and actually compiles a new regex pattern on every call. Thanks a lot for pointing this out.
Codebender almost 9 years

This wont work for strings "E" = 5 and "CD" = 3,4. Or for that matter any Pythagorean triplet.
chakming almost 9 years

@Codebender Length is checked before comparison
John over 8 years

Found a youtube video explaining the exact same code. youtube.com/watch?v=jjMIy3_ix-Q
Admin over 8 years

@user2900314 Key observation is that it must work for prime numbers. Anagrams will have the same "prime factorization".
chakming over 8 years

plus, int value of alphabetical char is hard to cause such possible cases
Makoto over 8 years

@John: Thanks for spotting that. I hadn't noticed or paid it much attention before, but that's good to know. Seems like it carries the same issues as pointed out prior, anyway.
mur7ay over 8 years

This should be the best answer.
Alex Salauyou over 8 years

How will it work? The first .get(c)++ will throw an NPE
maria nabil over 8 years

"this is untested, might be some syntax errors" I'll update it to be correct syntax
Paul Hankin about 8 years

I disagree with comments that this is O(N log N). Java's sort (for primitive types) is a quicksort variant, and quicksort runs in worst case O(N) time if the objects to be sorted are drawn from a finite set (such as values of type char).
Valentin Montmirail over 7 years

That's a beautiful use of Prime Numbers ;) congrats !
Rajavel D over 7 years

Is this code work properly poll & pool. It will return as true but its not.
Bill the Lizard about 7 years

This is clever, but not the fastest. Multiplication is not O(n).
miracle173 about 7 years

this is an awful post that shouldn't be upvoted (or even downvoted) as long it contains the claim it is the fastes algorithm. the question is a duplicated and there the accepted solutione is a fast one. But the idea with the primes is nice. A similar idea can be found in this video that solves a puzzle presented here: the instant insanity puzzle
localhost almost 7 years

Last iteration can be simplified: iterating over lettersInWord1.values() and check for each value for zero. Also, second iteration can be optimized by quick return (false) if !contains.
Amardeep Kumar almost 7 years

We can just use the ASCII value summation. If it matches then it is Anagram. check here: stackoverflow.com/a/45236219/1425181
Tarek over 6 years

just a quick little improvement, I'd add the following test, that checks if the two strings are equal (ignoring case), as that would make them, naturally, anagrams: if (a.equalsIgnoreCase(b)) return true.
Shanu Gupta about 6 years

Perfect O(n) and O(1) solution.
Karl over 5 years

If we compare the length of the strings at the start to make sure they are equal, can't we assume that the last iteration is not needed as long as we check for when the count in the second iteration goes below 0?
jkonst over 5 years

The cost of Arrays.sort is O(nlogn) > O(n)
Vhndaree over 5 years

any sum of the square of 3 numbers will never equal to the sum of the square of other 3 numbers. is it just your assumption or proved somewhere.
HASSAN MD TAREQ almost 5 years

How about using 2 Hashsets?
pkuderov about 4 years

The proposed method is wrong as it compares the sum of character codes of the strings. If two strings are anagrams, then they have the same sum, But the opposite is not true in general.
xrisk almost 4 years

Do not do this please. This is a party trick, not actual code to be used.
kcsquared over 2 years

This still doesn't work; "add" and "ebb" both have 29,409 as the sum of squares of characters.
phatfingers over 2 years

In Java, the "^" operator means XOR, so I don't think this code does what is intended.
chux - Reinstate Monica over 2 years

Code can fail when 1) encoding is not ASCII, 2) single letter occurrence exceeds INT_MAX, 3) Other locales with more than 26 letters or multiple case mappings to A-Z. Other than that, looks good.
chux - Reinstate Monica over 2 years

First 26 primes [2...97] need 7 bits, so a 64-bit integer product well handles up to a 9 A-Z letter word. Past that, extended math incurs its own cost past O(n).
chux - Reinstate Monica over 2 years

@AmardeepKumar ASCII value summation is insufficient. "03", "12" has same ASCII sum yet are not anagrams.