Byte order mark screws up file reading in Java

java utf-8 byte-order-mark

85,987

Solution 1

EDIT: I've made a proper release on GitHub: https://github.com/gpakosz/UnicodeBOMInputStream

Here is a class I coded a while ago, I just edited the package name before pasting. Nothing special, it is quite similar to solutions posted in SUN's bug database. Incorporate it in your code and you're fine.

/* ____________________________________________________________________________
 * 
 * File:    UnicodeBOMInputStream.java
 * Author:  Gregory Pakosz.
 * Date:    02 - November - 2005    
 * ____________________________________________________________________________
 */
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
 * The <code>UnicodeBOMInputStream</code> class wraps any
 * <code>InputStream</code> and detects the presence of any Unicode BOM
 * (Byte Order Mark) at its beginning, as defined by
 * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
 * 
 * <p>The
 * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
 * defines 5 types of BOMs:<ul>
 * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
 * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
 * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
 * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
 * <li><pre>EF BB BF     = UTF-8</pre></li>
 * </ul></p>
 * 
 * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
 * or not.
 * </p>
 * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
 * wrapped <code>InputStream</code> object.</p>
 */
public class UnicodeBOMInputStream extends InputStream
{
  /**
   * Type safe enumeration class that describes the different types of Unicode
   * BOMs.
   */
  public static final class BOM
  {
    /**
     * NONE.
     */
    public static final BOM NONE = new BOM(new byte[]{},"NONE");

    /**
     * UTF-8 BOM (EF BB BF).
     */
    public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                       (byte)0xBB,
                                                       (byte)0xBF},
                                            "UTF-8");

    /**
     * UTF-16, little-endian (FF FE).
     */
    public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE},
                                                "UTF-16 little-endian");

    /**
     * UTF-16, big-endian (FE FF).
     */
    public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-16 big-endian");

    /**
     * UTF-32, little-endian (FF FE 00 00).
     */
    public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE,
                                                            (byte)0x00,
                                                            (byte)0x00},
                                                "UTF-32 little-endian");

    /**
     * UTF-32, big-endian (00 00 FE FF).
     */
    public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                            (byte)0x00,
                                                            (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-32 big-endian");

    /**
     * Returns a <code>String</code> representation of this <code>BOM</code>
     * value.
     */
    public final String toString()
    {
      return description;
    }

    /**
     * Returns the bytes corresponding to this <code>BOM</code> value.
     */
    public final byte[] getBytes()
    {
      final int     length = bytes.length;
      final byte[]  result = new byte[length];

      // Make a defensive copy
      System.arraycopy(bytes,0,result,0,length);

      return result;
    }

    private BOM(final byte bom[], final String description)
    {
      assert(bom != null)               : "invalid BOM: null is not allowed";
      assert(description != null)       : "invalid description: null is not allowed";
      assert(description.length() != 0) : "invalid description: empty string is not allowed";

      this.bytes          = bom;
      this.description  = description;
    }

            final byte    bytes[];
    private final String  description;

  } // BOM

  /**
   * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
   * specified <code>InputStream</code>.
   * 
   * @param inputStream an <code>InputStream</code>.
   * 
   * @throws NullPointerException when <code>inputStream</code> is
   * <code>null</code>.
   * @throws IOException on reading from the specified <code>InputStream</code>
   * when trying to detect the Unicode BOM.
   */
  public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                      IOException

  {
    if (inputStream == null)
      throw new NullPointerException("invalid input stream: null is not allowed");

    in = new PushbackInputStream(inputStream,4);

    final byte  bom[] = new byte[4];
    final int   read  = in.read(bom);

    switch(read)
    {
      case 4:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE) &&
            (bom[2] == (byte)0x00) &&
            (bom[3] == (byte)0x00))
        {
          this.bom = BOM.UTF_32_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0x00) &&
            (bom[1] == (byte)0x00) &&
            (bom[2] == (byte)0xFE) &&
            (bom[3] == (byte)0xFF))
        {
          this.bom = BOM.UTF_32_BE;
          break;
        }

      case 3:
        if ((bom[0] == (byte)0xEF) &&
            (bom[1] == (byte)0xBB) &&
            (bom[2] == (byte)0xBF))
        {
          this.bom = BOM.UTF_8;
          break;
        }

      case 2:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE))
        {
          this.bom = BOM.UTF_16_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0xFE) &&
            (bom[1] == (byte)0xFF))
        {
          this.bom = BOM.UTF_16_BE;
          break;
        }

      default:
        this.bom = BOM.NONE;
        break;
    }

    if (read > 0)
      in.unread(bom,0,read);
  }

  /**
   * Returns the <code>BOM</code> that was detected in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return a <code>BOM</code> value.
   */
  public final BOM getBOM()
  {
    // BOM type is immutable.
    return bom;
  }

  /**
   * Skips the <code>BOM</code> that was found in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return this <code>UnicodeBOMInputStream</code>.
   * 
   * @throws IOException when trying to skip the BOM from the wrapped
   * <code>InputStream</code> object.
   */
  public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
  {
    if (!skipped)
    {
      in.skip(bom.bytes.length);
      skipped = true;
    }
    return this;
  }

  /**
   * {@inheritDoc}
   */
  public int read() throws IOException
  {
    return in.read();
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[]) throws  IOException,
                                          NullPointerException
  {
    return in.read(b,0,b.length);
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[],
                  final int off,
                  final int len) throws IOException,
                                        NullPointerException
  {
    return in.read(b,off,len);
  }

  /**
   * {@inheritDoc}
   */
  public long skip(final long n) throws IOException
  {
    return in.skip(n);
  }

  /**
   * {@inheritDoc}
   */
  public int available() throws IOException
  {
    return in.available();
  }

  /**
   * {@inheritDoc}
   */
  public void close() throws IOException
  {
    in.close();
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void mark(final int readlimit)
  {
    in.mark(readlimit);
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void reset() throws IOException
  {
    in.reset();
  }

  /**
   * {@inheritDoc}
   */
  public boolean markSupported() 
  {
    return in.markSupported();
  }

  private final PushbackInputStream in;
  private final BOM                 bom;
  private       boolean             skipped = false;

} // UnicodeBOMInputStream

And you're using it this way:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
  public static void main(final String[] args) throws Exception
  {
    FileInputStream fis = new FileInputStream("test/offending_bom.txt");
    UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

    System.out.println("detected BOM: " + ubis.getBOM());

    System.out.print("Reading the content of the file without skipping the BOM: ");
    InputStreamReader isr = new InputStreamReader(ubis);
    BufferedReader br = new BufferedReader(isr);

    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();

    fis = new FileInputStream("test/offending_bom.txt");
    ubis = new UnicodeBOMInputStream(fis);
    isr = new InputStreamReader(ubis);
    br = new BufferedReader(isr);

    ubis.skipBOM();

    System.out.print("Reading the content of the file after skipping the BOM: ");
    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();
  }

} // UnicodeBOMInputStreamUsage

Solution 2

The Apache Commons IO library has an InputStream that can detect and discard BOMs: BOMInputStream (javadoc):

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}

If you also need to detect different encodings, it can also distinguish among various different byte-order marks, e.g. UTF-8 vs. UTF-16 big + little endian - details at the doc link above. You can then use the detected ByteOrderMark to choose a Charset to decode the stream. (There's probably a more streamlined way to do this if you need all of this functionality - maybe the UnicodeReader in BalusC's answer?). Note that, in general, there's not a very good way to detect what encoding some bytes are in, but if the stream starts with a BOM, apparently this can be helpful.

Edit: If you need to detect the BOM in UTF-16, UTF-32, etc, then the constructor should be:

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
        ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

Upvote @martin-charlesworth's comment :)

Solution 3

Solution 4

Google Data API has an UnicodeReader which automagically detects the encoding.

You can use it instead of InputStreamReader. Here's an -slightly compactized- extract of its source which is pretty straightforward:

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

Solution 5

The Apache Commons IO Library's BOMInputStream has already been mentioned by @rescdsk, but I did not see it mention how to get an InputStream without the BOM.

Here's how I did it in Scala.

 import java.io._
 val file = new File(path_to_xml_file_with_BOM)
 val fileInpStream = new FileInputStream(file)   
 val bomIn = new BOMInputStream(fileInpStream, 
         false); // false means don't include BOM

View more solutions

85,987

Tom

Updated on July 08, 2022

Comments

Tom almost 2 years

I'm trying to read CSV files using Java. Some of the files may have a byte order mark in the beginning, but not all. When present, the byte order gets read along with the rest of the first line, thus causing problems with string compares.

Is there an easy way to skip the byte order mark when it is present?
- Chris almost 12 years
  
  maybe: rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html
Gregory Pakosz over 14 years

Sorry for the long scrolling areas, too bad there is no attachment feature
Tom over 14 years

Thanks Gregory, that's just what I'm looking for.
Gregory Pakosz over 14 years

You're welcome. I remember I discovered this problem after editing XML configuration files with the most widespread XML editor in the world: Notepad.exe which inserts a BOM when saving back a file that contains Unicode characters :)
codeporn about 13 years

Great, your answer helped me a lot! +1
xtofl almost 13 years

Great decorator! It may be a good idea to delegate the BOM recognition to the BOM class, too, though. Chain of responsibility, someone?
Gregory Pakosz over 12 years

yeah well the great chain of design patterns... ;)
Alvin about 12 years

Gregory, thanks for your solution. I'm going to use it in one of my project :)
atamanroman almost 12 years

Just skips the BOM. Should be the perfect solution for 99% of the use cases.
Denys Kniazhev-Support Ukraine over 11 years

This should be in core Java API
Varun Bhatia about 11 years

Why not add javaCharset key as a member in UnicodeBOMInputStream whose value can be used to read file accordingly in InputStreamReader isr = new InputStreamReader(ubis, ubis.getCharsetKey()) where getCharsetKey return the Java.charset values as per the BOM found.
Kevin Meredith over 10 years

I used this answer successfully. However, I would respectfully add the boolean arg for specifying whether to include or exclude the BOM. Example: BOMInputStream bomIn = new BOMInputStream(in, false); // don't include the BOM
Admin over 10 years

Thanks! Saved my day!
Vahid Pazirandeh almost 10 years

Very nice Andrei. But could you explain why it works? How does the pattern 0xFEFF successfully match UTF-8 files which seem to have a different pattern and 3 bytes instead of 2? And how can that pattern match both endians of UTF16 and UTF32?
Admin almost 10 years

As you can see - I don't use byte stream but character stream opened with expected charset. So if the first character from this stream is BOM - I skip it. BOM can have different byte representation for each encoding, but this is one character. Please read this article, it helps me: joelonsoftware.com/articles/Unicode.html
Snow almost 10 years

Nice solution, just make sure to check if file is not empty to avoid IOException in skip method before reading. You may do that by calling if (reader.ready()){ reader.read(possibleBOM) ... }
Martin Charlesworth almost 10 years

I would also add that this only detects UTF-8 BOM. If you want to detect all the utf-X BOMs then you need to pass them in to the BOMInputStream constructor. BOMInputStream bomIn = new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);
Gregory Pakosz almost 9 years

10 years have passed and I'm still receiving karma for this :D I'm looking at you Java!
SOUser almost 8 years

It seems that the link says Google Data API is deprecated ? Where should one look for the Google Data API now ?
bvdb almost 8 years

I see you have covered 0xFE 0xFF, which is the Byte order Mark for UTF-16BE. But, what if the first 3 bytes are 0xEF 0xBB 0xEF ? (the byte order mark for UTF-8). You claim that this works for all UTF-8 formats. Which could be true (I haven't tested your code), but then how does it work ?
BalusC almost 8 years

@XichenLi: GData API is been deprecated for its intented purpose. I didn't intend to suggest to use GData API directly (OP isn't using any GData service), but I intend to take over the source code as example for your own implementation. That's also why I included it in my answer, ready for copypaste.
Vladimir Vagaytsev almost 8 years

Single arg constructor does it: public BOMInputStream(InputStream delegate) { this(delegate, false, ByteOrderMark.UTF_8); }. It excludes UTF-8 BOM by default.
Kevin Meredith almost 8 years

Good point, Vladimir. I see that in its docs - commons.apache.org/proper/commons-io/javadocs/api-2.2/org/…: Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.
Admin almost 8 years

See my answer to Vahid: I open not the byte stream but character stream and read one character from it. Never mind what utf encoding used for file - bom prefix can represented by different count of bytes, but in terms of characters it's just one character
mike rodent about 7 years

Using your code I managed to solve this BOM problem... but strangely your code as written didn't work for me: int read = in.read( bom ) in fact returned 4 for me, not 3, so everything went wrong, despite the fact that this was a UTF-8 BOM. I followed up with in.skip( 3 )... and was then able to SAX parse my file. Strange that no-one else has mentioned this. NB offending BOM characters: "ï»¿". Also, int casts of the bytes at the start of the line came out at: "-17, -69, -65, 60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, ...". This might be of some help...
mike rodent about 7 years

Which byte array, by the way, comes out as "EFBBBF3C3F786D6C2076657..." using, for example, org.apache.commons.codec.binary.Hex.encodeHexString( bytes )
Joshua Taylor almost 7 years

There's a bug in this. The UTF-32LE case is unreachable. In order for (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00) to be true, then the UTF-16LE case ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) would have already matched.
Joshua Taylor almost 7 years

Since this code is from the Google Data API, I posted issue 471 about it.
Mohsen Abasi over 6 years

Not worked for me!! I don't know why but maybe my file in in locale "fa_IR"
WesternGun over 6 years

As for the comment of @KevinMeredith, I want to stress that the constructor with boolean is clearer, but the default constructor has already got rid of UTF-8 BOM, as the JavaDoc suggests: BOMInputStream(InputStream delegate) Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.
WesternGun over 6 years

I think Java is following a "lazy" pattern: it does things reactively, assuming all the data input is in good order and format. Same happens with Java way of reading keystore and build cert chain. Java: pillar in the "lazy" world, yes you are!
StackUMan about 6 years

That did not work for me, but I used .replaceFirst("\u00EF\u00BB\u00BF", "") which did.
Vivit almost 6 years

Great solution, Andrei! Thank you very much!
MxLDevs almost 6 years

Upvoted because answer provides history regarding why file input stream does not provide the option to discard BOM by default.
Admin over 5 years

The mark() method mark a position in the input to which the stream can be "reset" by calling the reset() method. It needed for future reads, after BOM skipping
shmosel over 5 years

If you're trying to mark the second index, you should call it after reading.
Bhaskar almost 4 years

Skipping solves most of my problems. If my file starts with a BOM UTF_16BE, can I create an InputReader by skipping the BOM and reading the file as UTF_8? So far it works, I want to understand if there is any edge case? Thanks in advance.
Heri over 2 years

IMO the best answer (and coding example), except that it falls back to UTF-8 if there is no BOM. See also my general answer below.