UTF-8 to ASCII conversion in java

11,513

Solution 1

This is an XY problem.

The problem here is that your String was created from bytes, using an incorrect charset that assumes one byte is one character, like like ISO 8559-1.

But the bytes are not ASCII and they are not ISO 8859-1. The bytes are a UTF-8 representation of text.

Do not replace any characters. Do not normalize the string. The only correct solution is to revert the incorrectly decoded String back to bytes, then correctly decode the bytes using UTF-8:

byte[] originalBytes = str.getBytes(StandardCharsets.ISO_8859_1);

str = new String(originalBytes, StandardCharsets.UTF_8);

Solution 2

There is no µ char in ASCII, so you can't write it in ASCII.

Java Strings are sequence of unicode characters (and are internally encoded in UTF-16), so the problem you have depends either on how you read this string or on how you write it.

Normally this thing are solved by creating an OutputStreamWriter(OutputStream out, String charsetName) or InputStreamReader(InputStream in, String charsetName) setting the correct character set.

So if for example you get your string from an UTF-8 encoded file, you should create a reader like this:

Reader in = new InputStreamReader(new FileInputStream('some_file.txt'),"UTF-8")

Or if you are writing to an ISO-Latin-1 file you should create the Writer like this:

Writer out = new OutputStreamWriter(new FileOutputStream('some_file.txt'),"ISO-8859-1")

Similar things can happen with HTTP request / response, depending on how the body of each is interpreted by either the application server or browser, if that's your case, then you add some detail to your question.

Share:
11,513
Admin
Author by

Admin

Updated on June 29, 2022

Comments

  • Admin
    Admin almost 2 years

    I have one string which contains UTF-8 character set format.

    String str = "100µF";
    

    And my desire output of above string is "100µF"

    I have checked on StackOverflow and i got below code

    public static String decompose(String s) {
        return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
    }
    

    But, I got the output of above string was "100AµF"

  • Tom Blodget
    Tom Blodget over 5 years
    This is the only answer with the correct analysis. However, given the sample data, it is not definitive that ISO 8859-1 should be used to undo the damage. My system has 8 character encodings that would correct this sample: windows -1250, windows-1252, windows-1254, windows-1258, iso-8859-1, iso-8859-3, iso-8859-9, and iso-8859-15. At most, one of them could be correct. @dev22intellial, if you can't find the incorrect code, you could possibly feed in a comprehensive test dataset (say a file with bytes 0-255) and detect if it can be reversed by exactly one character encoding.
  • Remy Lebeau
    Remy Lebeau over 5 years
    Alternatively, assuming the String was created by simply extending raw bytes to 16-bit chars without regard to any charsets, then you could just allocate a byte[] array of the same length and then truncate each 16-bit char back to an 8-bit byte.