Reading File from Windows and Linux yields different results (character encoding?)

14,525

Solution 1

� is a sequence of three characters - 0xEF 0xBF 0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement character for illegal UTF-8 sequences.

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89 (and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89 character is displayed as .

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89 with the UTF-8 replacement characters 0xEF 0xBF 0xBD (how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as �.

If you need to resolve this problem, you'll need to ensure the following in Linux:

  • Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.
  • When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFD it is actually the diamond character �) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).
  • It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.

* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.

Solution 2

In Java, Stringbyte[].

  • byte[] represents raw binary data.
  • String represents text, which has an associated charset/encoding to be able to tell which characters it represents.

Binary Data ≠ Text.

Text data inside a String has Unicode/UTF-16 as charset/encoding (or Unicode/mUTF-8 when serialized). Whenever you convert from something that is not a String to a String or viceversa, you need to specify a charset/encoding for the non-String text data (even if you do it implicitly, using the platform's default charset).

A PNG file contains raw binary data that represents an image (and associated metadata), not text. Therefore, you should not treat it as text.

\x89PNG is not text, it's just a "magic" header for identifying PNG files. 0x89 isn't even a character, it's just an arbitrary byte value, and its only sane representations for display are things like \x89, 0x89, ... Likewise, PNG there is in reality binary data, it could as well have been 0xdeadbeef and it would have changed nothing. The fact that PNG happens to be human-readable is just a convenience.

Your problem comes from the fact that your protocol mixes text and binary data, while Java (unlike some other languages, like C) treats binary data differently than text.

Java provides *InputStream for reading binary data, and *Reader for reading text. I see two ways to deal with input:

  • Treat everything as binary data. When you read a whole text line, convert it into a String, using the appropriate charset/encoding.
  • Layer a InputStreamReader on top of a InputStream, access the InputStream directly when you want binary data, access the InputStreamReader when you want text.

You may want buffering, the correct place to put it in the second case is below the *Reader. If you used a BufferedReader, the BufferedReader would probably consume more input from the InputStream than it should. So, you would have something like:

 ┌───────────────────┐
 │ InputStreamReader │
 └───────────────────┘
          ↓
┌─────────────────────┐
│ BufferedInputStream │
└─────────────────────┘
          ↓
   ┌─────────────┐
   │ InputStream │
   └─────────────┘

You would use the InputStreamReader to read text, then you would use the BufferedInputStream to read an appropriate amount of binary data from the same stream.

A problematic case is recognizing both "\r" (old MacOS) and "\r\n" (DOS/Windows) as line terminators. In that case, you may end up reading one character too much. You could take the approach that the deprecated DataInputStream.readline() method took: transparently wrap the internal InputStream into a PushbackInputStream and unread that character.

However, since you don't appear to have a Content-Length, I would recommend the first way, treating everything as binary, and convert to String only after reading a whole line. In this case, I would treat the MIME delimiter as binary data.

Output:

Since you are dealing with binary data, you cannot just println() it. PrintStream has write() methods that can deal with binary data (e.g: for outputting to a binary file).

Or maybe your data has to be transported on a channel that treats it as text. Base64 is designed for that exact situation (transporting binary data as ASCII text). Base64 encoded form uses only US_ASCII characters, so you should be able to use it with any charset/encoding that is a superset of US_ASCII (ISO-8859-*, UTF-8, CP-1252, ...). Since you are converting binary data to/from text, the only sane API for Base64 would be something like:

String Base64Encode(byte[] data);
byte[] Base64Decode(String encodedData);

which is basically what the internal java.util.prefs.Base64 uses.

Conclusion:

In Java, Stringbyte[].

Binary Data ≠ Text.

Share:
14,525
Maurice
Author by

Maurice

Updated on June 20, 2022

Comments

  • Maurice
    Maurice almost 2 years

    Currently I'm trying to read a file in a mime format which has some binary string data of a png.

    In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture.


    An example after reading the file in Windows is below:

        --fh-mms-multipart-next-part-1308191573195-0-53229
         Content-Type: image/png;name=app_icon.png
         Content-ID: "<app_icon>"
         content-location: app_icon.png
    
        ‰PNG
    

    etc...etc...

    An example after reading the file in Linux is below:

        --fh-mms-multipart-next-part-1308191573195-0-53229
         Content-Type: image/png;name=app_icon.png
         Content-ID: "<app_icon>"
         content-location: app_icon.png
    
         �PNG
    

    etc...etc...


    I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols.

    Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.

  • Maurice
    Maurice almost 13 years
    Hi Vineet, Great writeup! The thing I want to do is to split the parts into String[] to manipulate the data because I need to encode the png binary into base64. Gonna test this out. Thanks!
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    I don't think you need to split the file. You can feed in the byte stream to a Base64 encoder like Apache Commons Codec, which will do the job for you. What would be necessary is to read the file in the appropriate encoding.
  • ninjalj
    ninjalj almost 13 years
    There's mo appropriate encoding. The file should be read as a sequence of bytes.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    @ninjalj, you're right. I was mistaken. I was referring to FileOutputStream and not FileInputStream. Edit: Mistaken again. I meant InputStreamReader and OutputStreamWriter. There is also the String object which can be in a different charset if read from a stream with a different encoding.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    Turns out I'm partially right. If the plain FileInputStream and FileOutputStream classes are used, then one doesn't need to be worried about encoding, except when converting the bytes to a String. If the InputStreamReader and OutputStreamWriter classes are used, then it is necessary to know the encoding.
  • Maurice
    Maurice almost 13 years
    I'm copying from a FileInputStream to a ByteArrayOutputStream and returning the byte[]. String testString = new String(bytes); System.out.println(testString); The testString returns me the ISO-8859-1 looking String. Where should I put the encoding to make it look like the CP1252 string? thanks.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    Are you referring to Base64 encoding, if so, that appears to be fine. Out of curiosity, where did you see this problem with the Unicode replacement characters? Was it gEdit by any chance?
  • Maurice
    Maurice almost 13 years
    Can't use the Commons codec unless they handle just a multi part mime for me. The thing is I'm getting a ByteArrayInputStream in the AttachmentPart.getContent() and having trouble extracting the png portion of the mime and converting it to base64. Before converting it to base64 I want to test if the binary data I received is valid. Hence all the problems :(
  • Maurice
    Maurice almost 13 years
    Saw it in the terminal of Putty on a Windows machine SSH into a Linux machine.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    CP1252 is a Windows codepage. I don't think it is a good idea to explicitly store it in that encoding. Besides, why aren't you feeding the bytearray directly to Base64 Codec from Commons Codec? It is much easier to deal with ASCII output (i.e., the base64 encoded string). Edit: Changed link to Base64 javadoc.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    I can barely imagine how many decoders are used when you view it using Putty. The terminal console font, or the file encoding, and several other variables would have played havoc.
  • Maurice
    Maurice almost 13 years
    Because they only want the Binary portion i.e. the image/png to be in Base64 Encoding. Others like text in the mime remain as text. The ByteArrayInputStream appears something like this: --boundary smil stuff --boundary Content-Type: image/png;name=app_icon.png Content-ID: "<app_icon>" content-location: app_icon.png All the binary stuff (This portions changes to base64 only) --boundary text stuff --boundary--
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    Then you could ignore the first 8 bytes of every PNG file, and you could still work off the byte-array. In my opinion, it is not worth the trouble of stuffing a string with that data, for all String objects are created as UTF-16 strings, unless an explicit charset is specified.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    One possible way to possibly determine if you are Base64 encoding the right input (in the right encoding), is to also decode it back to a binary file and match the content with the original binary data from the PNG file.
  • Maurice
    Maurice almost 13 years
    That's the problem. I tried extracting the chunk of binary out which gives me ISO-8859-1 characters. I can't encode these in base64 and send it over because this will lead to them decoding the base64 and getting the ISO-8859-1 characters. They want a format where after decoding they are able to see the image just by changing the file extension.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    I'm mistaken once again (having been doing this quite some time and I get a bit cocky). Your statement String testString = new String(bytes); will return a string with the same encoding as the platform's default encoding. So you should be safe if your platform encoding is ISO-8859-1.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    @Maurice, in that case, the receiver (whoever they are) ought to decode the base64 string into a byte array. If they've provided the original requirement of using base64, surely they should be handling this payload as well. My point of contention is that, if you need to verify that they can do this without a problem in your provided data, then you'll need to compare the decoded byte array yourself to see if the encoding has been handled correctly.
  • Maurice
    Maurice almost 13 years
    Will this work? String content = //ISO-8859-1 looking data after reading from ByteArrayInputStream. byte[] encoded = Base64.encodeBase64(content.getBytes()); Then I use the encoded string. Will the other end be able to decode and get the png?
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    Yes, since you are reading the file into a sequence of bytes (where no decoding/encoding operation is applied), then treating it as the input to a Base64 encoder, which returns a sequence of ASCII-7 (not ISO-8859-1 actually) characters that represents the original bytes. If you decode the Base64 sequence to a byte array, you must and should get back the original byte sequence, to enable obtaining the PNG.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    You're welcome. PS: Don't forget to up vote the stuff that is useful on this site.
  • Maurice
    Maurice almost 13 years
    Ack. Sad update. I tried the same piece of code in both platforms encoding the same binary string into base64 and both return a different base64 string.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    Are the MD5/SHA1 hash of the files on platforms the same or different? On Windows, you can use Cygwin to obtain md5sum and sha1sum utilities. That is a better indicator of whether your algorithm is working or not. An even better indicator is to apply your scheme to the entire file and compare the outputs across the platforms.
  • Maurice
    Maurice almost 13 years
    Nevermind. Now it works. I must have missed something. Cheers!
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    Ah well, that's good to know. I need to have some sleep now, and this problem although very interesting has drained me. Have a good day.
  • Maurice
    Maurice almost 13 years
    Ack! I overlooked something. I directly encoded the byte arrays read from the files so those work. But if convert the byte array to a string first then encode it, I get different results. Sucks.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    "But if convert the byte array to a string first", because that would depend on the platform encoding like I pointed above. Windows might be using cp1252, while Linux might be using iso-8859-1. Read more on this at this question. The simple answer is to do one of 1) specify the encoding explicitly and use InputStreamReader and OutputStreamWriter, or 2) work only off bytes, and never convert them to Strings (otherwise the bytes will be encoded to a different encoding).
  • Maurice
    Maurice almost 13 years
    The Linux encoding is UTF-8. I have no idea why after it reads a file in CP1252 format, it converts the string to a iso-8559-1 format. On the Mac it reads the file as UTF-8.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    You can perform an decoding process from CP1252 to utf8 by using String str = new String(byteArray, "UTF-8"); That is the option (1), mentioned above. You'll need to ensure that the internal encoding of the Strings that you are using is consistent through out the app though.
  • Maurice
    Maurice almost 13 years
    doesn't work. still getting iso-8559-1 string. mind boggling. changed the putty output to windows 1252 and still nothing.
  • Vineet Reynolds
    Vineet Reynolds almost 13 years
    From the base64 encoding routine? That is understandable. Base64 does not need any UTF-8 characters in it, except for the ones in ASCII-7.
  • Maurice
    Maurice almost 13 years
    Neh not form the base64 routine. Currently reading a cp1252 png file in my Linux machine. When I do String str = new String(byteArray,"UTF-8"); I get the ISO-8559-1 string. Basically even if I do get it back to the UTF-8 format, it doesn't become a format that can change into a png file. It will be a png file with text as UTF-8.
  • Maurice
    Maurice almost 13 years
    Sigh if only they did not have to have an inner mime, things will be easier. LOL.
  • Maurice
    Maurice almost 13 years
    Something to note. Once you do String s = new String(bytes);, the size of the bytes increase! What is going on?
  • Lasse V. Karlsen
    Lasse V. Karlsen almost 13 years
    Hi guys, please don't use the commenting system as a chat room. It is for leaving a few comments and prods for more information to a question or answer, not for long debates. The reason behind this is that most of the time (and this is one of them), a lot if not all the comments belong as edits to the question/answer to make that more complete. If I have to read a half-page answer + 3 pages of comments, the focus on the comments is too big. Please edit in pertinent details into the answer instead. If you really need to chat, find/create a chat-room on the Chat site, link at the top of the page
  • dan04
    dan04 almost 13 years
    This should be the accepted answer. A "suitable encoding for the PNG file" is as nonsensical as "the color palette of a TXT file".