How can non-ASCII characters be removed from a string?

138,203

Solution 1

This will search and replace all non ASCII letters:

String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");

Solution 2

FailedDev's answer is good, but can be improved. If you want to preserve the ascii equivalents, you need to normalize first:

String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");

=> will produce "oau"

That way, characters like "öäü" will be mapped to "oau", which at least preserves some information. Without normalization, the resulting String will be blank.

Solution 3

This would be the Unicode solution

String s = "A função, Ãugent";
String r = s.replaceAll("\\P{InBasic_Latin}", "");

\p{InBasic_Latin} is the Unicode block that contains all letters in the Unicode range U+0000..U+007F (see regular-expression.info)

\P{InBasic_Latin} is the negated \p{InBasic_Latin}

Solution 4

You can try something like this. Special Characters range for alphabets starts from 192, so you can avoid such characters in the result.

String name = "A função";

StringBuilder result = new StringBuilder();
for(char val : name.toCharArray()) {
    if(val < 192) result.append(val);
}
System.out.println("Result "+result.toString());

Solution 5

[Updated solution]

can be used with "Normalize" (Canonical decomposition) and "replaceAll", to replace it with the appropriate characters.

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;

public final class NormalizeUtils {

    public static String normalizeASCII(final String string) {
        final String normalize = Normalizer.normalize(string, Form.NFD);

        return Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
                      .matcher(normalize)
                      .replaceAll("");
    } ...
Share:
138,203
rahulsri
Author by

rahulsri

Updated on November 28, 2021

Comments

  • rahulsri
    rahulsri over 2 years

    I have strings "A função", "Ãugent" in which I need to replace characters like ç, ã, and à with empty strings.

    How can I remove those non-ASCII characters from my string?

    I have attempted to implement this using the following function, but it is not working properly. One problem is that the unwanted characters are getting replaced by the space character.

    public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
        String newsrcdta = null;
        char array[] = Arrays.stringToCharArray(tmpsrcdta);
        if (array == null)
            return newsrcdta;
    
        for (int i = 0; i < array.length; i++) {
            int nVal = (int) array[i];
            boolean bISO =
                    // Is character ISO control
                    Character.isISOControl(array[i]);
            boolean bIgnorable =
                    // Is Ignorable identifier
                    Character.isIdentifierIgnorable(array[i]);
            // Remove tab and other unwanted characters..
            if (nVal == 9 || bISO || bIgnorable)
                array[i] = ' ';
            else if (nVal > 255)
                array[i] = ' ';
        }
        newsrcdta = Arrays.charArrayToString(array);
    
        return newsrcdta;
    }
    
  • rahulsri
    rahulsri over 12 years
    thanks for response.. but this "A" is still not able to replace with empty string.
  • FailedDev
    FailedDev over 12 years
    @rahulsri A is a perfectly valid ASCII character. Why should it be replaced?
  • stema
    stema over 12 years
    Why do you check against 192 and not 128 (what would be the ASCII table)? You are assuming a certain encoding (I think ISO-8859-1), but what if the encoding is ISO-8859-2/3/4/5/7... ? There are letters in those area of the table.
  • rahulsri
    rahulsri over 12 years
    @Dev i think it is not visible but this is a Latin character whose Unicode value is "\u00c3".
  • FailedDev
    FailedDev over 12 years
    @rahulsri Can you post this, which cannot be replaced by editing your question please?
  • FailedDev
    FailedDev over 12 years
    @rahulsri \u00c3 == Ã and yes, it is replaced. You have something wrong elsewhere.
  • Admin
    Admin over 12 years
    Yes, It depends upon the number of characters we want to allow as well as the encoding. This is just the example. We can add condition based on required characters and encoding.
  • Zouppen
    Zouppen over 11 years
    Most likely you want to strip non-printable and control characters, too. In that case you would use the following regexp: "[^\\x20-\\x7E]" Or simply: "[^ -~]"
  • sidgeon smythe
    sidgeon smythe over 10 years
    (Note to anyone confused like me: the uppercase \P is negation.)
  • stema
    stema over 9 years
    @user1187719, you could be more precise, than "This does not work". This answer already received some upvotes, so it can not be completely useless. Of course, if you have a Java version before Java 7, than I agree. Unicode in regex is not working there.
  • Saket
    Saket over 9 years
    Your answer is good, but can be improved. Removing the usage of Regex in your code and replacing it with a for loop is incredibly faster (20-40x). More here: stackoverflow.com/a/15191508/2511884
  • Michael Böckling
    Michael Böckling over 9 years
    Thanks for the hint. The extent of the difference in performance was unexpected.
  • Entropy
    Entropy over 9 years
    @stema - I ran it in Java 6, so your Java 7 theory holds water.
  • AL̲̳I
    AL̲̳I almost 8 years
    it removes the special characters and "not" replace them with ASCII equivalent
  • stema
    stema almost 8 years
    @Ali, yes you exactly understood my answer. This is what has been asked for 5 years ago. If it is not what you need, go with Michael Böcklings answer.
  • chesterm8
    chesterm8 over 7 years
    You probably want to use Normalizer.Form.NFKD rather than NFD - NFKD will convert things like ligatures into ascii characters (eg fi to fi), NFD will not do this.
  • dvlcube
    dvlcube over 6 years
    Normalizer.normalize("ãéío – o áá", Normalizer.Form.NFD).replaceAll("[^\\x00-\\x7F]", ""); yields "aeio o aa" but echo "ãéío – o áá" | iconv -f utf8 -t ascii//TRANSLIT yields "aeio - o aa". Is there a way to make java replace "–" with "-" like with iconv?
  • hem
    hem about 5 years
    Thank a lot Really you saved my day.
  • M. Justin
    M. Justin over 3 years
    "[^\\p{ASCII}]" is an equivalent alternative to "[^\\x00-\\x7F]".