How to compare Unicode characters that "look alike"?

15,387

Solution 1

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        char first = 'μ';
        char second = 'µ';

        // Technically you only need to normalize U+00B5 to obtain U+03BC, but
        // if you're unsure which character is which, you can safely normalize both
        string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
        string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);

        Console.WriteLine(first.Equals(second));                     // False
        Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
    }
}

For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

Solution 2

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).

References:

So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:

public void Main()
{
    var s1 = "μ";
    var s2 = "µ";

    Console.WriteLine(s1.Equals(s2));  // false
    Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true 
}

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormKC);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

And the Demo

Solution 3

They both have different character codes: Refer this for more details

Console.WriteLine((int)'μ');  //956
Console.WriteLine((int)'µ');  //181

Where, 1st one is:

Display     Friendly Code   Decimal Code    Hex Code    Description
====================================================================
μ           &mu;            &#956;          &#x3BC;     Lowercase Mu
µ           &micro;         &#181;          &#xB5;      micro sign Mu

Image

Solution 4

For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.

However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.

Solution 5

Search both characters in a Unicode database and see the difference.

One is the Greek small Letter µ and the other is the Micro Sign µ.

Name            : MICRO SIGN
Block           : Latin-1 Supplement
Category        : Letter, Lowercase [Ll]
Combine         : 0
BIDI            : Left-to-Right [L]
Decomposition   : <compat> GREEK SMALL LETTER MU (U+03BC)
Mirror          : N
Index entries   : MICRO SIGN
Upper case      : U+039C
Title case      : U+039C
Version         : Unicode 1.1.0 (June, 1993)

Name            : GREEK SMALL LETTER MU
Block           : Greek and Coptic
Category        : Letter, Lowercase [Ll]
Combine         : 0
BIDI            : Left-to-Right [L]
Mirror          : N
Upper case      : U+039C
Title case      : U+039C
See Also        : micro sign U+00B5
Version         : Unicode 1.1.0 (June, 1993)
Share:
15,387
D J
Author by

D J

SOreadytohelp

Updated on June 03, 2022

Comments

  • D J
    D J about 2 years

    I fall into a surprising issue.

    I loaded a text file in my application and I have some logic which compares the value having µ.

    And I realized that even if the texts are same the compare value is false.

     Console.WriteLine("μ".Equals("µ")); // returns false
     Console.WriteLine("µ".Equals("µ")); // return true
    

    In later line the character µ is copy pasted.

    However, these might not be the only characters that are like this.

    Is there any way in C# to compare the characters which look the same but are actually different?

  • user2864740
    user2864740 over 10 years
    Thanks for the Unicode spec link. First time I ever read up on it. Small note from it: "Normalization Forms KC and KD must not be blindly applied to arbitrary text .. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate."
  • user2864740
    user2864740 over 10 years
    Definitely good to know when using Normalize. It seems surprising that they remain distinct.
  • BudBrot
    BudBrot over 10 years
    Strange but works... I mean they are two different chars with different meanings and convert them to upper makes them equal? I dont see the logic but nice solution +1
  • supercat
    supercat over 10 years
    @user2864740: If an uppercase Greek tau didn't remain distinct from a Roman letter T, it would be very difficult to have Greek and Roman text sort sensibly into alphabetic order. Further, if a typeface were to use a different visual style for Greek and Roman letters, it would be very distracting if the Greek letters whose shapes resembled Roman letters were rendered differently from those which didn't.
  • Andrew Leach
    Andrew Leach over 10 years
    This solution masks the problem, and could cause issues in a general case. This sort of test would find that "m".ToUpper().Equals("µ".ToUpper()); and "M".ToUpper().Equals("µ".ToUpper()); are also true. This may not be desirable.
  • dan04
    dan04 over 10 years
    More importantly, unifying the European alphabets would make ToUpper / ToLower difficult to implement. You'd need to have "B".ToLower() be b in English but β in Greek and в in Russian. As it is, only Turkish (dotless i) and a couple of other languages need casing rules different from the default.
  • Konrad Rudolph
    Konrad Rudolph over 10 years
    -1 – this is a terrible idea. Do not work with Unicode like this.
  • svenv
    svenv over 10 years
    Instead of ToUpper()-based tricks, why not use String.Equals("μ", "μ", StringComparison.CurrentCultureIgnoreCase)?
  • MartinHaTh
    MartinHaTh over 10 years
    Out of curiosity, what is the reasoning for having two µ symbols? You don't see a dedicated K with the name "Kilo sign" (or do you?).
  • BoltClock
    BoltClock over 10 years
    @MartinHaTh: According to Wikipedia, it's "for historical reasons".
  • VGR
    VGR over 10 years
    Unicode has a lot of compatibility characters brought over from older character sets (like ISO 8859-1), to make conversion from those character sets easier. Back when character sets were constrained to 8 bits, they would include a few glyphs (like some Greek letters) for the most common math and scientific uses. Glyph reuse based on appearance was common, so no specialized 'K' was added. But it was always a workaround; the correct symbol for "micro" is the actual Greek lowercase mu, the correct symbol for Ohm is the actual capital omega, and so on.
  • Tanner - reinstate LGBT people
    Tanner - reinstate LGBT people over 10 years
    Are the "micro sign" and the lowercase mu character canonically equivalent? Using canonical normalization would give you a more strict comparison.
  • Oliver Hallam
    Oliver Hallam over 10 years
    Although there is a specialized K for Kelvin (temperature)
  • hippietrail
    hippietrail over 10 years
    @TannerL.Swett: Actually I'm not even sure how to check that off the top of my head ...
  • paulm
    paulm over 10 years
    Nothing better than when something is done for hysterical raisins
  • D J
    D J over 10 years
    Actually, I was importing a file with physics formula. You are right about normalization. I have to go through it more deeply..
  • hippietrail
    hippietrail over 10 years
    What kind of file? Something hand-made in plain Unicode text by a person? Or something output by an app in a specific format?
  • Admin
    Admin over 10 years
    Is there a special K for cereal?
  • Greg
    Greg over 10 years
    There is one good reason to distinguish between "MICRO SIGN" and "GREEK SMALL LETTER MU" - to say that "uppercase" of micro sign is still micro sign. But capitalization changes micro to mega, happy engineering.
  • dbw
    dbw over 10 years
    @Greg great one Capitalization of MICRO changes it to MEGA(924)
  • dbw
    dbw over 10 years
    @Pengu There is always a logic associated with all the thing that happen in computer nothing is unknown, the logic behind them is that they are converted to there defined Uppercase letter which points to 'M' (924 MEGHA) as symbols are known as mu and micro
  • Chris W. Rea
    Chris W. Rea over 10 years
    A special case of micro-optimization.
  • Konerak
    Konerak over 10 years
    How did this get 37 upvotes? It does not answer the question ("How to compare unicode characters"), it just comments on why this particular example is not equal. At best, it should be a comment on the question. I understand comment formatting options do not allow to post it as nicely as answer formatting options do, but that should not be a valid reason to post as an answer.
  • Subin Jacob
    Subin Jacob over 10 years
    Actually the question was a different one, asking why μ and µ equality check return false. This Answer answer it. Later OP asked another question (this question) how to compare two characters that look alike. Both questions had best answers and later one of the moderator merged both questions selecting best answer of the second one as best. Someone edited this question, so that it will summarize
  • Subin Jacob
    Subin Jacob over 10 years
    Actually, I didn't add any content after the merge
  • AncientSwordRage
    AncientSwordRage over 10 years
    @MartinHaTh imagine in 60 years people decide that the greek alphabet is not international enough, or a particular country decides that they don't want to use foreign characters for scientific notation (say they replace µ with 小) you can now change your encoding without breaking everything and having to remap for greek characters. In short Micro is encoded as Micro, whether it's displayed as a Mu or not.
  • supercat
    supercat almost 10 years
    @Pureferret: I wonder what would have happened if Unicode had defined characters for "decimal unity point" and "visual digit separator", and the visual appearance of those characters was controlled by the user's locale. Then a number which was formatted using those characters would display properly in many locales, and--more importantly--could be unambiguously converted back to a number in any locale, even if it was formatted in a different one.
  • supercat
    supercat almost 10 years
    @dan04: I wonder if anyone ever considered assigning unique code points to all four variations of the Turkish "i" and "I"? That would have eliminated any ambiguity in the behavior of toUpper/toLower.
  • Mr Lister
    Mr Lister over 6 years
    I will always wonder why Unicode doesn't have a MATHEMATICAL SYMBOL PI for historical reasons. This would have been a numerical symbol, but instead we have to use a Greek letter as a workaround.
  • Elmue
    Elmue over 3 years
    This answer is nonsense. If you have a list of hundreds of string this will be EXTREMELY slow.