How to check if two words are anagrams
Solution 1
Fastest algorithm would be to map each of the 26 English characters to a unique prime number. Then calculate the product of the string. By the fundamental theorem of arithmetic, 2 strings are anagrams if and only if their products are the same.
Solution 2
Two words are anagrams of each other if they contain the same number of characters and the same characters. You should only need to sort the characters in lexicographic order, and determine if all the characters in one string are equal to and in the same order as all of the characters in the other string.
Here's a code example. Look into Arrays
in the API to understand what's going on here.
public boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.replaceAll("[\\s]", "").toCharArray();
char[] word2 = secondWord.replaceAll("[\\s]", "").toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
Solution 3
If you sort either array, the solution becomes O(n log n). but if you use a hashmap, it's O(n). tested and working.
char[] word1 = "test".toCharArray();
char[] word2 = "tes".toCharArray();
Map<Character, Integer> lettersInWord1 = new HashMap<Character, Integer>();
for (char c : word1) {
int count = 1;
if (lettersInWord1.containsKey(c)) {
count = lettersInWord1.get(c) + 1;
}
lettersInWord1.put(c, count);
}
for (char c : word2) {
int count = -1;
if (lettersInWord1.containsKey(c)) {
count = lettersInWord1.get(c) - 1;
}
lettersInWord1.put(c, count);
}
for (char c : lettersInWord1.keySet()) {
if (lettersInWord1.get(c) != 0) {
return false;
}
}
return true;
Solution 4
Here's a simple fast O(n) solution without using sorting or multiple loops or hash maps. We increment the count of each character in the first array and decrement the count of each character in the second array. If the resulting counts array is full of zeros, the strings are anagrams. Can be expanded to include other characters by increasing the size of the counts array.
class AnagramsFaster{
private static boolean compare(String a, String b){
char[] aArr = a.toLowerCase().toCharArray(), bArr = b.toLowerCase().toCharArray();
if (aArr.length != bArr.length)
return false;
int[] counts = new int[26]; // An array to hold the number of occurrences of each character
for (int i = 0; i < aArr.length; i++){
counts[aArr[i]-97]++; // Increment the count of the character at i
counts[bArr[i]-97]--; // Decrement the count of the character at i
}
// If the strings are anagrams, the counts array will be full of zeros
for (int i = 0; i<26; i++)
if (counts[i] != 0)
return false;
return true;
}
public static void main(String[] args){
System.out.println(compare(args[0], args[1]));
}
}
Solution 5
Lots of people have presented solutions, but I just want to talk about the algorithmic complexity of some of the common approaches:
-
The simple "sort the characters using
Arrays.sort()
" approach is going to beO(N log N)
. -
If you use radix sorting, that reduces to
O(N)
withO(M)
space, whereM
is the number of distinct characters in the alphabet. (That is 26 in English ... but in theory we ought to consider multi-lingual anagrams.) -
The "count the characters" using an array of counts is also
O(N)
... and faster than radix sort because you don't need to reconstruct the sorted string. Space usage will beO(M)
. -
A "count the characters" using a dictionary, hashmap, treemap, or equivalent will be slower that the array approach, unless the alphabet is huge.
-
The elegant "product-of-primes" approach is unfortunately
O(N^2)
in the worst case This is because for long-enough words or phrases, the product of the primes won't fit into along
. That means that you'd need to useBigInteger
, and N times multiplying aBigInteger
by a small constant isO(N^2)
.For a hypothetical large alphabet, the scaling factor is going to be large. The worst-case space usage to hold the product of the primes as a
BigInteger
is (I think)O(N*logM)
. -
A
hashcode
based approach is usuallyO(N)
if the words are not anagrams. If the hashcodes are equal, then you still need to do a proper anagram test. So this is not a complete solution.
I would also like to highlight that most of the posted answers assume that each code-point in the input string is represented as a single char
value. This is not a valid assumption for code-points outside of the BMP (plane 0); e.g. if an input string contains emojis.
The solutions that make the invalid assumption will probably work most of the time anyway. A code-point outside of the BMP will represented in the string as two char values: a low surrogate and a high surrogate. If the strings contain only one such code-point, we can get away with treating the char
values as if they were code-points. However, we can get into trouble when the strings being tested contain 2 or more code-points. Then the faulty algorithms will fail to distinguish some cases. For example, [SH1, SL1, SH2, SL2]
versus [SH1, SL2, SH2, SL1]
where the SH<n>
and SL<2>
denote high and low surrogates respectively. The net result will be false anagrams.
Alex Salauyou's answer gives a couple of solutions that will work for all valid Unicode code-points.
Chilli
Updated on July 25, 2022Comments
-
Chilli almost 2 years
I have a program that shows you whether two words are anagrams of one another. There are a few examples that will not work properly and I would appreciate any help, although if it were not advanced that would be great, as I am a 1st year programmer. "schoolmaster" and "theclassroom" are anagrams of one another, however when I change "theclassroom" to "theclafsroom" it still says they are anagrams, what am I doing wrong?
import java.util.ArrayList; public class AnagramCheck { public static void main(String args[]) { String phrase1 = "tbeclassroom"; phrase1 = (phrase1.toLowerCase()).trim(); char[] phrase1Arr = phrase1.toCharArray(); String phrase2 = "schoolmaster"; phrase2 = (phrase2.toLowerCase()).trim(); ArrayList<Character> phrase2ArrList = convertStringToArraylist(phrase2); if (phrase1.length() != phrase2.length()) { System.out.print("There is no anagram present."); } else { boolean isFound = true; for (int i = 0; i < phrase1Arr.length; i++) { for (int j = 0; j < phrase2ArrList.size(); j++) { if (phrase1Arr[i] == phrase2ArrList.get(j)) { System.out.print("There is a common element.\n"); isFound =; phrase2ArrList.remove(j); } } if (isFound == false) { System.out.print("There are no anagrams present."); return; } } System.out.printf("%s is an anagram of %s", phrase1, phrase2); } } public static ArrayList<Character> convertStringToArraylist(String str) { ArrayList<Character> charList = new ArrayList<Character>(); for (int i = 0; i < str.length(); i++) { charList.add(str.charAt(i)); } return charList; } }
-
MPogoda over 11 years@jb. I was just fixing logical issues
-
Chilli over 11 yearsThanks so much for that! Solved all my problems straight away! :D
-
Chilli over 11 yearsBrilliant help! Thanks for that! :D
-
Admin about 11 yearsAFAIK sorting twice is N(Log N)*N(Log N). This algorithm requires NN. Also, the second loop is not always executed which brings it average between N and NN. I did some rough testing with System.nanoTime() time diff, which shows that sorting is slower. Maybe I am overlooking something....
-
Admin about 11 yearscorrection: O(N LogN) + O(N Log N) vs O(N) < performance < O(N+N)
-
vonbrand about 11 yearsIt is $O(N \log N)$ versus $O(N^2)$, actually.
-
Admin about 11 yearsMaybe you can help me out why this is the case and why performance is slower for sort. I looked at Skiena's algorithm design manual, and agree with NlogN. The manual states that O(f(n)) +O(g(n)) = O(max(f(n),g(n))). Hence, the result becomes O(NlogN) vs N? Maybe it is me, but I do not see why this algorithm is N^2 as I do not see any nested loops unless they are hidden somewhere in JAVA? Also, why does it have lower CPU time?
-
Dejell about 11 yearsWhy does it answer the question?
-
assylias about 11 yearsFast probably but fastest probably not. In that case why not simply use two
char[26]
? And it has the obvious caveat that it only works for English. -
SeriousBusiness about 11 yearsshould work for all characters. Also it's O(n), which is faster than sorting.
-
le-doude almost 11 yearsYour algorithm runs in O(NlogN) (N is the lenght of both the strings together) there is a O(N) algorithm using O(N/2)~O(N) memory. See my answer
-
Koray Tugay over 10 yearsI think you should make all the characters to lowercase as well.
-
Koray Tugay over 10 yearsTwo phrases are anagrams if they are permutations of each other, ignoring spaces and capitalization. cs.duke.edu/csed/algoprobs/anagram.html
-
tucuxi over 10 yearsFor a long phrase you would quickly overflow 64 bits, and BigDecimals are /slow/; an int[26] is a much safer (and faster) bet.
-
tucuxi over 10 yearsYou need to specify <Character, Integer> or it will not compile. Autoboxing can take care of the rest
-
Eric Woodruff over 10 yearsWhy not first compare the lengths of each word before sorting them?
-
Makoto over 10 years@EricWoodruff: This probably wasn't intended as a 100% complete example; there are obvious optimizations in comparing lengths to begin with as well as forcing everything to be the same case. This was meant to be illustrative as opposed to cut-and-paste valid code.
-
Shubhankar Raj about 10 yearsCan you kindly clarify the segment for(char c : word1){ lettersInWord1.get(c)++; } for(char c : word2) { lettersInWord1.get(c)--; }
-
maria nabil about 10 yearsI'm adding up how many letters of each type are in the first word, "1 a, 2 b's, 6 d's" etc. Then I subtract the count of how many letters of each type are in the second word. If the letters are anagrams, they have the same number of letters of the same type. So if the map has all zero's, they are anagrams.
-
anshulkatta almost 10 yearsfor every put in the hashmap , hashcode and equals are called...is this really good for big strings...i think sort will be much faster than this in that case
-
anshulkatta almost 10 yearsbut the map is empty , you have not put the characters in
-
Zoran Horvat almost 10 yearsSolution with hash set takes O(n) time and O(m) additional space, where m is total number of distinct characters in two strings. Basically, m should not exceed 100 or so if strings are plain text. This means that hash set solution reduces asymptotic execution time from O(nlogn) to O(n) and reduces added space from O(n) to practically O(1). This should have significant impact in large strings.
-
greybeard almost 10 yearsIs "ac" an anagram of "bb"? Also, how many hash codes are there, and how many combinations of characters?
-
Ripu Daman almost 10 yearsI got your point. I did the little modification. It is generating the hash codes at runtime for any combination of string of unicode characters.
-
greybeard almost 10 yearsWhile your overload of
equals
does implement an equivalence relation, it throws onnull == obj
instead of returningfalse
. Worse, it does not implement equality wrt anagrams. Take strings over an alphabet with just 10 characters of numerical value 1 to 10 (to facilitate "multiplicative hashing"). Let your hashcode be one decimal digit, and consider strings of two characters, only: how many equivalence classes are there? Once you have more classes than codes, no amount of cleverness in encoding will save your day. -
Stephen C over 9 yearsThis approach can only do the "not an anagram" test fast. If two words / phrases have the same hashcode, you still have to do a longer test to see if they are algorithms.
-
Ripu Daman over 9 years@greybeard I didn't implement checks for null here because that wasn't the intent here. I was mainly focusing on finding the anagram easily. However I didn't understand your other comment.
-
Ripu Daman over 9 years@StephenC As per this approach, two strings will have same hash code only if they are anagram.
-
Stephen C over 9 years@ripudam - That is incorrect. In fact, it is simple to prove that it is incorrect. There 2^32 possible values for an
int
hashcode. The proof follows directly from that. -
Ripu Daman over 9 years@StephenC This is a great point but that doesn't show that this program will not work for. However, to be on the safer side, I added the length equality check. Can you provide the inputs for which this program wouldn't work?
-
greybeard over 9 yearsThis is about checking strings (or even
String
s) for being anagrams, not about providing counter examples for implementations of approaches proven wrong. In Unicode, any letter from the latin alphabet has an encoding 32 bigger than that of its capitalisation. Repeat each 2^26 times, check withAnagramTest.isAnagram
. With that out of the way, do somebody a favour: how many equivalence classes are there for strings of length two from an alphabet of size ten? There can be no injective function from a set into a smaller one, let alone from a (countably) infinite set into any finite set. -
Stephen C over 9 years@ripudam - If you are interested, you can generate an example for yourself. Start with a word (say "anagramatic") and iterate over all other alphabetic strings of the same length ... until you find one that has the same hashcode. Just a few billion iterations should be enough to find one. Once you have found one, the chances are that that string won't be an anagram of "anagramatic". But why bother ...
-
SuRa over 9 yearsI am getting Invalid argument to operation ++/-- for the line
lettersInWord1.get(c)++; lettersInWord1.get(c)--;
-
seriously divergent about 9 yearsCan't we just use the character values instead of prime numbers, I think it will also work, isnt it?
-
chakming almost 9 years@user2900314 I've add an answer to the 'AD'/'AC'-'BB' issue stackoverflow.com/questions/15045640/…
-
Ar5hv1r almost 9 yearsBoth
String.indexOf()
andString.replaceFirst()
are O(n) operations where n is the length of the string. You can't just "ignore" certain function calls; this is an O(n^2) solution. -
okello almost 9 yearsYou prompted me to do some research on this. And you are right. My solution has a time complexity of O(n^2). Whereas the
String.indexOf(..)
might be fast in practice, theString.replaceFirst()
seems slower, especially in loops. It can be sped up, though, by using regex directly where a pre-compiled regex pattern is used in the loop. TheString.replace*(..)
seems to use regex under the hood and actually compiles a new regex pattern on every call. Thanks a lot for pointing this out. -
Codebender almost 9 yearsThis wont work for strings
"E" = 5
and"CD" = 3,4
. Or for that matter any Pythagorean triplet. -
chakming almost 9 years@Codebender Length is checked before comparison
-
John over 8 yearsFound a youtube video explaining the exact same code. youtube.com/watch?v=jjMIy3_ix-Q
-
Admin over 8 years@user2900314 Key observation is that it must work for prime numbers. Anagrams will have the same "prime factorization".
-
chakming over 8 yearsplus, int value of alphabetical char is hard to cause such possible cases
-
Makoto over 8 years@John: Thanks for spotting that. I hadn't noticed or paid it much attention before, but that's good to know. Seems like it carries the same issues as pointed out prior, anyway.
-
mur7ay over 8 yearsThis should be the best answer.
-
Alex Salauyou over 8 yearsHow will it work? The first
.get(c)++
will throw an NPE -
maria nabil over 8 years"this is untested, might be some syntax errors" I'll update it to be correct syntax
-
Paul Hankin about 8 yearsI disagree with comments that this is O(N log N). Java's sort (for primitive types) is a quicksort variant, and quicksort runs in worst case O(N) time if the objects to be sorted are drawn from a finite set (such as values of type char).
-
Valentin Montmirail over 7 yearsThat's a beautiful use of Prime Numbers ;) congrats !
-
Rajavel D over 7 yearsIs this code work properly poll & pool. It will return as true but its not.
-
Bill the Lizard about 7 yearsThis is clever, but not the fastest. Multiplication is not O(n).
-
miracle173 about 7 yearsthis is an awful post that shouldn't be upvoted (or even downvoted) as long it contains the claim it is the fastes algorithm. the question is a duplicated and there the accepted solutione is a fast one. But the idea with the primes is nice. A similar idea can be found in this video that solves a puzzle presented here: the instant insanity puzzle
-
localhost almost 7 yearsLast iteration can be simplified: iterating over lettersInWord1.values() and check for each value for zero. Also, second iteration can be optimized by quick return (false) if !contains.
-
Amardeep Kumar almost 7 yearsWe can just use the ASCII value summation. If it matches then it is Anagram. check here: stackoverflow.com/a/45236219/1425181
-
Tarek over 6 yearsjust a quick little improvement, I'd add the following test, that checks if the two strings are equal (ignoring case), as that would make them, naturally, anagrams: if (a.equalsIgnoreCase(b)) return true.
-
Shanu Gupta about 6 yearsPerfect
O(n)
andO(1)
solution. -
Karl over 5 yearsIf we compare the length of the strings at the start to make sure they are equal, can't we assume that the last iteration is not needed as long as we check for when the count in the second iteration goes below 0?
-
jkonst over 5 yearsThe cost of Arrays.sort is O(nlogn) > O(n)
-
Vhndaree over 5 yearsany sum of the square of 3 numbers will never equal to the sum of the square of other 3 numbers. is it just your assumption or proved somewhere.
-
HASSAN MD TAREQ almost 5 yearsHow about using 2 Hashsets?
-
pkuderov about 4 yearsThe proposed method is wrong as it compares the sum of character codes of the strings. If two strings are anagrams, then they have the same sum, But the opposite is not true in general.
-
xrisk almost 4 yearsDo not do this please. This is a party trick, not actual code to be used.
-
kcsquared over 2 yearsThis still doesn't work; "add" and "ebb" both have 29,409 as the sum of squares of characters.
-
phatfingers over 2 yearsIn Java, the "^" operator means XOR, so I don't think this code does what is intended.
-
chux - Reinstate Monica over 2 yearsCode can fail when 1) encoding is not ASCII, 2) single letter occurrence exceeds
INT_MAX
, 3) Other locales with more than 26 letters or multiple case mappings to A-Z. Other than that, looks good. -
chux - Reinstate Monica over 2 yearsFirst 26 primes [2...97] need 7 bits, so a 64-bit integer product well handles up to a 9 A-Z letter word. Past that, extended math incurs its own cost past O(n).
-
chux - Reinstate Monica over 2 years@AmardeepKumar ASCII value summation is insufficient. "03", "12" has same ASCII sum yet are not anagrams.