Byte order mark screws up file reading in Java

85,987

Solution 1

EDIT: I've made a proper release on GitHub: https://github.com/gpakosz/UnicodeBOMInputStream


Here is a class I coded a while ago, I just edited the package name before pasting. Nothing special, it is quite similar to solutions posted in SUN's bug database. Incorporate it in your code and you're fine.

/* ____________________________________________________________________________
 * 
 * File:    UnicodeBOMInputStream.java
 * Author:  Gregory Pakosz.
 * Date:    02 - November - 2005    
 * ____________________________________________________________________________
 */
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
 * The <code>UnicodeBOMInputStream</code> class wraps any
 * <code>InputStream</code> and detects the presence of any Unicode BOM
 * (Byte Order Mark) at its beginning, as defined by
 * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
 * 
 * <p>The
 * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
 * defines 5 types of BOMs:<ul>
 * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
 * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
 * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
 * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
 * <li><pre>EF BB BF     = UTF-8</pre></li>
 * </ul></p>
 * 
 * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
 * or not.
 * </p>
 * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
 * wrapped <code>InputStream</code> object.</p>
 */
public class UnicodeBOMInputStream extends InputStream
{
  /**
   * Type safe enumeration class that describes the different types of Unicode
   * BOMs.
   */
  public static final class BOM
  {
    /**
     * NONE.
     */
    public static final BOM NONE = new BOM(new byte[]{},"NONE");

    /**
     * UTF-8 BOM (EF BB BF).
     */
    public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                       (byte)0xBB,
                                                       (byte)0xBF},
                                            "UTF-8");

    /**
     * UTF-16, little-endian (FF FE).
     */
    public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE},
                                                "UTF-16 little-endian");

    /**
     * UTF-16, big-endian (FE FF).
     */
    public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-16 big-endian");

    /**
     * UTF-32, little-endian (FF FE 00 00).
     */
    public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE,
                                                            (byte)0x00,
                                                            (byte)0x00},
                                                "UTF-32 little-endian");

    /**
     * UTF-32, big-endian (00 00 FE FF).
     */
    public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                            (byte)0x00,
                                                            (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-32 big-endian");

    /**
     * Returns a <code>String</code> representation of this <code>BOM</code>
     * value.
     */
    public final String toString()
    {
      return description;
    }

    /**
     * Returns the bytes corresponding to this <code>BOM</code> value.
     */
    public final byte[] getBytes()
    {
      final int     length = bytes.length;
      final byte[]  result = new byte[length];

      // Make a defensive copy
      System.arraycopy(bytes,0,result,0,length);

      return result;
    }

    private BOM(final byte bom[], final String description)
    {
      assert(bom != null)               : "invalid BOM: null is not allowed";
      assert(description != null)       : "invalid description: null is not allowed";
      assert(description.length() != 0) : "invalid description: empty string is not allowed";

      this.bytes          = bom;
      this.description  = description;
    }

            final byte    bytes[];
    private final String  description;

  } // BOM

  /**
   * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
   * specified <code>InputStream</code>.
   * 
   * @param inputStream an <code>InputStream</code>.
   * 
   * @throws NullPointerException when <code>inputStream</code> is
   * <code>null</code>.
   * @throws IOException on reading from the specified <code>InputStream</code>
   * when trying to detect the Unicode BOM.
   */
  public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                      IOException

  {
    if (inputStream == null)
      throw new NullPointerException("invalid input stream: null is not allowed");

    in = new PushbackInputStream(inputStream,4);

    final byte  bom[] = new byte[4];
    final int   read  = in.read(bom);

    switch(read)
    {
      case 4:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE) &&
            (bom[2] == (byte)0x00) &&
            (bom[3] == (byte)0x00))
        {
          this.bom = BOM.UTF_32_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0x00) &&
            (bom[1] == (byte)0x00) &&
            (bom[2] == (byte)0xFE) &&
            (bom[3] == (byte)0xFF))
        {
          this.bom = BOM.UTF_32_BE;
          break;
        }

      case 3:
        if ((bom[0] == (byte)0xEF) &&
            (bom[1] == (byte)0xBB) &&
            (bom[2] == (byte)0xBF))
        {
          this.bom = BOM.UTF_8;
          break;
        }

      case 2:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE))
        {
          this.bom = BOM.UTF_16_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0xFE) &&
            (bom[1] == (byte)0xFF))
        {
          this.bom = BOM.UTF_16_BE;
          break;
        }

      default:
        this.bom = BOM.NONE;
        break;
    }

    if (read > 0)
      in.unread(bom,0,read);
  }

  /**
   * Returns the <code>BOM</code> that was detected in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return a <code>BOM</code> value.
   */
  public final BOM getBOM()
  {
    // BOM type is immutable.
    return bom;
  }

  /**
   * Skips the <code>BOM</code> that was found in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return this <code>UnicodeBOMInputStream</code>.
   * 
   * @throws IOException when trying to skip the BOM from the wrapped
   * <code>InputStream</code> object.
   */
  public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
  {
    if (!skipped)
    {
      in.skip(bom.bytes.length);
      skipped = true;
    }
    return this;
  }

  /**
   * {@inheritDoc}
   */
  public int read() throws IOException
  {
    return in.read();
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[]) throws  IOException,
                                          NullPointerException
  {
    return in.read(b,0,b.length);
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[],
                  final int off,
                  final int len) throws IOException,
                                        NullPointerException
  {
    return in.read(b,off,len);
  }

  /**
   * {@inheritDoc}
   */
  public long skip(final long n) throws IOException
  {
    return in.skip(n);
  }

  /**
   * {@inheritDoc}
   */
  public int available() throws IOException
  {
    return in.available();
  }

  /**
   * {@inheritDoc}
   */
  public void close() throws IOException
  {
    in.close();
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void mark(final int readlimit)
  {
    in.mark(readlimit);
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void reset() throws IOException
  {
    in.reset();
  }

  /**
   * {@inheritDoc}
   */
  public boolean markSupported() 
  {
    return in.markSupported();
  }

  private final PushbackInputStream in;
  private final BOM                 bom;
  private       boolean             skipped = false;

} // UnicodeBOMInputStream

And you're using it this way:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
  public static void main(final String[] args) throws Exception
  {
    FileInputStream fis = new FileInputStream("test/offending_bom.txt");
    UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

    System.out.println("detected BOM: " + ubis.getBOM());

    System.out.print("Reading the content of the file without skipping the BOM: ");
    InputStreamReader isr = new InputStreamReader(ubis);
    BufferedReader br = new BufferedReader(isr);

    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();

    fis = new FileInputStream("test/offending_bom.txt");
    ubis = new UnicodeBOMInputStream(fis);
    isr = new InputStreamReader(ubis);
    br = new BufferedReader(isr);

    ubis.skipBOM();

    System.out.print("Reading the content of the file after skipping the BOM: ");
    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();
  }

} // UnicodeBOMInputStreamUsage

Solution 2

The Apache Commons IO library has an InputStream that can detect and discard BOMs: BOMInputStream (javadoc):

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}

If you also need to detect different encodings, it can also distinguish among various different byte-order marks, e.g. UTF-8 vs. UTF-16 big + little endian - details at the doc link above. You can then use the detected ByteOrderMark to choose a Charset to decode the stream. (There's probably a more streamlined way to do this if you need all of this functionality - maybe the UnicodeReader in BalusC's answer?). Note that, in general, there's not a very good way to detect what encoding some bytes are in, but if the stream starts with a BOM, apparently this can be helpful.

Edit: If you need to detect the BOM in UTF-16, UTF-32, etc, then the constructor should be:

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
        ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

Upvote @martin-charlesworth's comment :)

Solution 3

More simple solution:

public class BOMSkipper
{
    public static void skip(Reader reader) throws IOException
    {
        reader.mark(1);
        char[] possibleBOM = new char[1];
        reader.read(possibleBOM);

        if (possibleBOM[0] != '\ufeff')
        {
            reader.reset();
        }
    }
}

Usage sample:

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
BOMSkipper.skip(input);
//Now UTF prefix not present:
input.readLine();
...

It works with all 5 UTF encodings!

Solution 4

Google Data API has an UnicodeReader which automagically detects the encoding.

You can use it instead of InputStreamReader. Here's an -slightly compactized- extract of its source which is pretty straightforward:

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

Solution 5

The Apache Commons IO Library's BOMInputStream has already been mentioned by @rescdsk, but I did not see it mention how to get an InputStream without the BOM.

Here's how I did it in Scala.

 import java.io._
 val file = new File(path_to_xml_file_with_BOM)
 val fileInpStream = new FileInputStream(file)   
 val bomIn = new BOMInputStream(fileInpStream, 
         false); // false means don't include BOM
Share:
85,987

Related videos on Youtube

Tom
Author by

Tom

Updated on July 08, 2022

Comments

  • Tom
    Tom almost 2 years

    I'm trying to read CSV files using Java. Some of the files may have a byte order mark in the beginning, but not all. When present, the byte order gets read along with the rest of the first line, thus causing problems with string compares.

    Is there an easy way to skip the byte order mark when it is present?

  • Gregory Pakosz
    Gregory Pakosz over 14 years
    Sorry for the long scrolling areas, too bad there is no attachment feature
  • Tom
    Tom over 14 years
    Thanks Gregory, that's just what I'm looking for.
  • Gregory Pakosz
    Gregory Pakosz over 14 years
    You're welcome. I remember I discovered this problem after editing XML configuration files with the most widespread XML editor in the world: Notepad.exe which inserts a BOM when saving back a file that contains Unicode characters :)
  • codeporn
    codeporn about 13 years
    Great, your answer helped me a lot! +1
  • xtofl
    xtofl almost 13 years
    Great decorator! It may be a good idea to delegate the BOM recognition to the BOM class, too, though. Chain of responsibility, someone?
  • Gregory Pakosz
    Gregory Pakosz over 12 years
    yeah well the great chain of design patterns... ;)
  • Alvin
    Alvin about 12 years
    Gregory, thanks for your solution. I'm going to use it in one of my project :)
  • atamanroman
    atamanroman almost 12 years
    Just skips the BOM. Should be the perfect solution for 99% of the use cases.
  • Denys Kniazhev-Support Ukraine
    Denys Kniazhev-Support Ukraine over 11 years
    This should be in core Java API
  • Varun Bhatia
    Varun Bhatia about 11 years
    Why not add javaCharset key as a member in UnicodeBOMInputStream whose value can be used to read file accordingly in InputStreamReader isr = new InputStreamReader(ubis, ubis.getCharsetKey()) where getCharsetKey return the Java.charset values as per the BOM found.
  • Kevin Meredith
    Kevin Meredith over 10 years
    I used this answer successfully. However, I would respectfully add the boolean arg for specifying whether to include or exclude the BOM. Example: BOMInputStream bomIn = new BOMInputStream(in, false); // don't include the BOM
  • Admin
    Admin over 10 years
    Thanks! Saved my day!
  • Vahid Pazirandeh
    Vahid Pazirandeh almost 10 years
    Very nice Andrei. But could you explain why it works? How does the pattern 0xFEFF successfully match UTF-8 files which seem to have a different pattern and 3 bytes instead of 2? And how can that pattern match both endians of UTF16 and UTF32?
  • Admin
    Admin almost 10 years
    As you can see - I don't use byte stream but character stream opened with expected charset. So if the first character from this stream is BOM - I skip it. BOM can have different byte representation for each encoding, but this is one character. Please read this article, it helps me: joelonsoftware.com/articles/Unicode.html
  • Snow
    Snow almost 10 years
    Nice solution, just make sure to check if file is not empty to avoid IOException in skip method before reading. You may do that by calling if (reader.ready()){ reader.read(possibleBOM) ... }
  • Martin Charlesworth
    Martin Charlesworth almost 10 years
    I would also add that this only detects UTF-8 BOM. If you want to detect all the utf-X BOMs then you need to pass them in to the BOMInputStream constructor. BOMInputStream bomIn = new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);
  • Gregory Pakosz
    Gregory Pakosz almost 9 years
    10 years have passed and I'm still receiving karma for this :D I'm looking at you Java!
  • SOUser
    SOUser almost 8 years
    It seems that the link says Google Data API is deprecated ? Where should one look for the Google Data API now ?
  • bvdb
    bvdb almost 8 years
    I see you have covered 0xFE 0xFF, which is the Byte order Mark for UTF-16BE. But, what if the first 3 bytes are 0xEF 0xBB 0xEF ? (the byte order mark for UTF-8). You claim that this works for all UTF-8 formats. Which could be true (I haven't tested your code), but then how does it work ?
  • BalusC
    BalusC almost 8 years
    @XichenLi: GData API is been deprecated for its intented purpose. I didn't intend to suggest to use GData API directly (OP isn't using any GData service), but I intend to take over the source code as example for your own implementation. That's also why I included it in my answer, ready for copypaste.
  • Vladimir Vagaytsev
    Vladimir Vagaytsev almost 8 years
    Single arg constructor does it: public BOMInputStream(InputStream delegate) { this(delegate, false, ByteOrderMark.UTF_8); }. It excludes UTF-8 BOM by default.
  • Kevin Meredith
    Kevin Meredith almost 8 years
    Good point, Vladimir. I see that in its docs - commons.apache.org/proper/commons-io/javadocs/api-2.2/org/…: Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.
  • Admin
    Admin almost 8 years
    See my answer to Vahid: I open not the byte stream but character stream and read one character from it. Never mind what utf encoding used for file - bom prefix can represented by different count of bytes, but in terms of characters it's just one character
  • mike rodent
    mike rodent about 7 years
    Using your code I managed to solve this BOM problem... but strangely your code as written didn't work for me: int read = in.read( bom ) in fact returned 4 for me, not 3, so everything went wrong, despite the fact that this was a UTF-8 BOM. I followed up with in.skip( 3 )... and was then able to SAX parse my file. Strange that no-one else has mentioned this. NB offending BOM characters: "". Also, int casts of the bytes at the start of the line came out at: "-17, -69, -65, 60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, ...". This might be of some help...
  • mike rodent
    mike rodent about 7 years
    Which byte array, by the way, comes out as "EFBBBF3C3F786D6C2076657..." using, for example, org.apache.commons.codec.binary.Hex.encodeHexString( bytes )
  • Joshua Taylor
    Joshua Taylor almost 7 years
    There's a bug in this. The UTF-32LE case is unreachable. In order for (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00) to be true, then the UTF-16LE case ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) would have already matched.
  • Joshua Taylor
    Joshua Taylor almost 7 years
    Since this code is from the Google Data API, I posted issue 471 about it.
  • Mohsen Abasi
    Mohsen Abasi over 6 years
    Not worked for me!! I don't know why but maybe my file in in locale "fa_IR"
  • WesternGun
    WesternGun over 6 years
    As for the comment of @KevinMeredith, I want to stress that the constructor with boolean is clearer, but the default constructor has already got rid of UTF-8 BOM, as the JavaDoc suggests: BOMInputStream(InputStream delegate) Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.
  • WesternGun
    WesternGun over 6 years
    I think Java is following a "lazy" pattern: it does things reactively, assuming all the data input is in good order and format. Same happens with Java way of reading keystore and build cert chain. Java: pillar in the "lazy" world, yes you are!
  • StackUMan
    StackUMan about 6 years
    That did not work for me, but I used .replaceFirst("\u00EF\u00BB\u00BF", "") which did.
  • Vivit
    Vivit almost 6 years
    Great solution, Andrei! Thank you very much!
  • MxLDevs
    MxLDevs almost 6 years
    Upvoted because answer provides history regarding why file input stream does not provide the option to discard BOM by default.
  • Admin
    Admin over 5 years
    The mark() method mark a position in the input to which the stream can be "reset" by calling the reset() method. It needed for future reads, after BOM skipping
  • shmosel
    shmosel over 5 years
    If you're trying to mark the second index, you should call it after reading.
  • Bhaskar
    Bhaskar almost 4 years
    Skipping solves most of my problems. If my file starts with a BOM UTF_16BE, can I create an InputReader by skipping the BOM and reading the file as UTF_8? So far it works, I want to understand if there is any edge case? Thanks in advance.
  • Heri
    Heri over 2 years
    IMO the best answer (and coding example), except that it falls back to UTF-8 if there is no BOM. See also my general answer below.