How to remove BOM from byte array

11,872

Solution 1

All of the C# XML parsers will automatically handle the BOM for you. I'd recommend using XDocument - in my opinion it provides the cleanest abstraction of XML data.

Using XDocument as an example:

using (var stream = new memoryStream(bytes))
{
  var document = XDocument.Load(stream);
  ...
}

Once you have an XDocument you can then use it to omit the bytes without the BOM:

using (var stream = new MemoryStream())
using (var writer = XmlWriter.Create(stream))
{
  writer.Settings.Encoding = new UTF8Encoding(false);
  document.WriteTo(writer);
  var bytesWithoutBOM = stream.ToArray();
}

Solution 2

You could do something like this to skip the BOM bytes while reading from a stream. You would need to extend the Bom.cs to include further encodings, however afaik UTF is the only encoding using BOM... could (most likely) be wrong about that though.

I got the info on the encoding types from here

using (var stream = File.OpenRead("path_to_file"))
{
    stream.Position = Bom.GetCursor(stream);
}


public static class Bom
{
        public static int GetCursor(Stream stream)
        {
            // UTF-32, big-endian
            if (IsMatch(stream, new byte[] {0x00, 0x00, 0xFE, 0xFF}))
                return 4;
            // UTF-32, little-endian
            if (IsMatch(stream, new byte[] { 0xFF, 0xFE, 0x00, 0x00 }))
                return 4;
            // UTF-16, big-endian
            if (IsMatch(stream, new byte[] { 0xFE, 0xFF }))
                return 2;
            // UTF-16, little-endian
            if (IsMatch(stream, new byte[] { 0xFF, 0xFE }))
                return 2;
            // UTF-8
            if (IsMatch(stream, new byte[] { 0xEF, 0xBB, 0xBF }))
                return 3;
            return 0;
        }

        private static bool IsMatch(Stream stream, byte[] match)
        {
            stream.Position = 0;
            var buffer = new byte[match.Length];
            stream.Read(buffer, 0, buffer.Length);
            return !buffer.Where((t, i) => t != match[i]).Any();
        }
    }

Solution 3

You don't have to worry about BOM.

If for some reason you need to use and XmlDocument object maybe this code can help you:

byte[] file_content = {wherever you get it};
XmlDocument xml = new XmlDocument();
xml.Load(new MemoryStream(file_content));

It worked for me when i tried to download an xml attachment from a gmail account using Google Api and the file have BOM and using Encoding.UTF8.GetString(file_content) didn't work "properly".

Share:
11,872
Ravi Gupta
Author by

Ravi Gupta

Updated on June 20, 2022

Comments

  • Ravi Gupta
    Ravi Gupta almost 2 years

    I have xml data in byte[] byteArray which may or mayn't contain BOM. Is there any standard way in C# to remove BOM from it? If not, what is the best way, which handles all the cases including all types of encoding, to do the same?

    Actually, I am fixing a bug in the code and I don't want to change much of the code. So it would be better if someone can give me the code to remove BOM.

    I know that I can do like find out 60 which is ASCII value of '<' and ignore bytes before that but I don't want to do that.

  • Ravi Gupta
    Ravi Gupta about 11 years
    actually i want to remove BOM only and don't have to care about parsing and all. I have updated the question as well.
  • Rich O'Kelly
    Rich O'Kelly about 11 years
    @RaviGupta I see, do you know the encoding?
  • Ravi Gupta
    Ravi Gupta about 11 years
    it would be better if the logic be encoding free.
  • Rich O'Kelly
    Rich O'Kelly about 11 years
    @RaviGupta Answer updated. There may be a more efficient way, perhaps looking at the internals of XmlReader to see how they detect the BOM, however what I have written above should work fine.
  • Ravi Gupta
    Ravi Gupta about 11 years
    can we do it for all encoding? like instead of doing writer.Settings.Encoding = new UTF8Encoding(false); can we do writer.Settings.Encoding = new Encoding .... something like that
  • Rich O'Kelly
    Rich O'Kelly about 11 years
    @RaviGupta The above code will 'normalise' the encoding to be UTF8. An encoding must be specified when writing out the bytes, you can choose an alternate however, UTF8 was chosen arbitrarily.