ISO-8859-1 encoding and binary data preservation

15,151

Solution 1

"\u00F6" is not a byte array. It's a string containing a single char. Execute the following test instead:

public static void main(String[] args) throws Exception {
    byte[] b = new byte[] {(byte) 0x00, (byte) 0xf6};
    String s = new String(b, "ISO-8859-1"); // decoding
    byte[] b2 = s.getBytes("ISO-8859-1"); // encoding
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2)); // true
}

To check that this is true for any byte, just improve the code an loop through all the bytes:

public static void main(String[] args) throws Exception {
    byte[] b = new byte[256];
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte) i;
    }
    String s = new String(b, "ISO-8859-1");
    byte[] b2 = s.getBytes("ISO-8859-1");
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2));
}

ISO-8859-1 is a standard encoding. So the language used (Java, C# or whatever) doesn't matter.

Here's a Wikipedia reference that claims that every byte is covered:

In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.

(emphasis mine)

Solution 2

For an encoding to retain original binary data, it needs to map every unique byte sequence to an unique character sequence.

This rules out all multi-byte encodings (UTF-8/16/32, Shift-Jis, Big5 etc) because not every byte sequence is valid in them and thus would decode to some replacement character (usually ? or �). There is no way to tell from the string what caused the replacement character after it has been decoded.

Another option is to ignore the invalid bytes but this also means that infinite different byte sequences decode to the same string. You could replace invalid bytes with their hex encoding in the string like "0xFF". There is no way to tell if the original bytes legitimately decoded to "0xFF" so that doesn't work either.

This leaves 8-bit encodings, where every sequence is just a single byte. The single byte is valid if there is a mapping for it. But many 8-bit encodings have holes and don't encode 256 different characters.

To retain original binary data, you need 8-bit encoding that encodes 256 different characters. ISO-8859-1 is not unique in this. But what it is unique in, is that the decoded code point's value is also the byte's value it was decoded from.

So you have the decoded string, and encoded bytes, then it is always

(byte)str.charAt(i) == bytes[i] 

for arbitrary binary data where str is new String(bytes, "ISO-8859-1") and bytes is a byte[].


It also has nothing to do with Java. I have no idea what his comment means, these are properties of character encodings not programming languages.

Share:
15,151
Mr_and_Mrs_D
Author by

Mr_and_Mrs_D

Be warned - the Monster isAlife Git, Java, Android and finally Python I was flirting with JEE since a couple years but since 1/2014 we are having an affair I spent the best part of the last year refactoring a widely used mod manager application. Here is the commit message of the release I have been working on, where I detail what I have been doing: https://github.com/wrye-bash/wrye-bash/commit/1cd839fadbf4b7338b1c12457f601066b39d1929 I am interested in code quality and performance (aka in the code as opposed to what the code does) If you find my posts useful you can buy me a coffee TCP walks into a bar &amp; says: “I’d like a beer.” “You’d like a beer?” “Yes, a beer.”

Updated on June 23, 2022

Comments

  • Mr_and_Mrs_D
    Mr_and_Mrs_D almost 2 years

    I read in a comment to an answer by @Esailija to a question of mine that

    ISO-8859-1 is the only encoding to fully retain the original binary data, with exact byte<->codepoint matches

    I also read in this answer by @AaronDigulla that :

    In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping

    I need some insight on this. This will fail (as illustrated here) :

    // \u00F6 is ö
    System.out.println(Arrays.toString("\u00F6".getBytes("utf-8")));
    // prints [-61, -74]
    System.out.println(Arrays.toString("\u00F6".getBytes("ISO-8859-1")));
    // prints [-10]
    

    Questions

    1. I admit I do not quite get it - why does it not get the bytes in the code above ?
    2. Most importantly, where is this (byte preserving behavior of ISO-8859-1) specified - links to source, or JSL would be nice. Is it the only encoding with this property ?
    3. Is it related to ISO-8859-1 being the default default ?

    See also this question for nice counter examples from other charsets.

  • Mr_and_Mrs_D
    Mr_and_Mrs_D about 11 years
    Thank you - this answers 1 - I really need some authoritative link to ascertain/deny 2) - or some explanation - namely the byte preserving behavior of ISO-8859-1 - and that it is the only one (in java and possibly other languages I guess like C#). For other charsets see : stackoverflow.com/questions/2544965/…
  • Mr_and_Mrs_D
    Mr_and_Mrs_D about 11 years
    Thanks - do you believe it is the only ENCODING with this property : byte[] b = newByteArray(); Arrays.equals(b, new String(b, ENCODING).getBytes(ENCODING)); // always true ?
  • JB Nizet
    JB Nizet about 11 years
    I don't know if it's the only one, no.
  • Esailija
    Esailija about 11 years
    @Mr_and_Mrs_D it is the only one where the decoded code point's value is also the byte's value it was decoded from (\u00F6 <-> 0xF6). That's what I meant. So you have the decoded string, and encoded bytes, then it is always (byte)str.charAt(i) == bytes[i] for arbitrary binary data where str is new String(bytes, "ISO-8859-1")
  • Esailija
    Esailija about 11 years
    @Mr_and_Mrs_D it is also a rare property but not unique to ISO-8859-1 for a lossless round-trip: bytes -> string -> bytes with arbitrary binary data.
  • Mr_and_Mrs_D
    Mr_and_Mrs_D about 11 years
    The comment was meant to put this all in code perspective (notice also "In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping")- aka I don't know how all this would look in C - very informative answer @JBNizet ("But what it is unique in, is that the decoded code point's value is also the byte's value it was decoded from") +1 :)
  • Ludovic Kuty
    Ludovic Kuty about 5 years
    I overlooked that information from Wikipedia on their ISO-8859-1 page. Thanks for emphasizing it.
  • Ludovic Kuty
    Ludovic Kuty about 5 years
    FYI I detailed why new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8") does the job when wanting to read UTF-8 text with RandomAccessFile and get a String in this answer.