Writing UTF-8 without BOM

26,327

"A" written using UTF-8 without a BOM produces exactly the same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.

Think of it this way:

  • "A".getBytes("UTF-8") returns a new byte[] { 65 }
  • "A".getBytes("ISO-8859-1") returns a new byte[] { 65 }
  • You write the results of those calls into a file
  • How is the consumer of the file supposed to distinguish the two?

There's nothing in that file that suggests that UTF-8 needs to be used to decode it.

Try writing "Käsekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tells it which encoding to use).

Share:
26,327
Mawia
Author by

Mawia

Java Programmer

Updated on July 19, 2022

Comments

  • Mawia
    Mawia almost 2 years

    This code,

    OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
    out.write("A".getBytes());
    

    And this,

    OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
    out.write("A".getBytes(StandardCharsets.UTF_8));
    

    produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM, but no encoding is being selected in the "Encoding" menu.

    Now, this code write the file in UTF-8 with BOM encoding.

     OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
     byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
     out.write(bom);
     out.write("A".getBytes()); 
    

    Notepad++ is also displaying the encoding type as Encode in UTF-8.

    Question: What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?

    Is notepad++ only guessing around?

  • Mawia
    Mawia over 10 years
    Do you mean that notepad++ is only guesing around?
  • Joachim Sauer
    Joachim Sauer over 10 years
    @Mawia: yes, exactly. "Plain text" has no metadata that would tell it the encoding (except if there is a BOM, of course), so it uses a set of heuristics to guess which encoding is most likely. And that's not really the fault of Notepad++: there's nothing much you can do other than guessing (you could ask the user every time, but that would get annoying quickly).
  • Mawia
    Mawia over 10 years
    OK, I think that makes sense, 'cause when I write it in UTF-16, notepad++ is showing as Encode in UCS-2 Big Endian. So, notepad++ is simply guessing around, right?
  • Joachim Sauer
    Joachim Sauer over 10 years
    @Mawia: I already wrote in the answer that it guesses, I also confirmed it in my comment above. Are you waiting for a third confirmation? ;-) Some encodings have "more obvious" tells than others: UTF-16, for example can often be detected if every second byte is 0 (for english language text), while UTF-8 can be detected by some common sequences (and other byte sequences that can never occur in it). Other encodings can be "detected" by statistical analysis of the byte values. But all of that is really just guessing.
  • HookUp
    HookUp about 10 years
    A better understanding here