How to remove BOM from byte array
Solution 1
All of the C# XML parsers will automatically handle the BOM for you. I'd recommend using XDocument - in my opinion it provides the cleanest abstraction of XML data.
Using XDocument as an example:
using (var stream = new memoryStream(bytes))
{
var document = XDocument.Load(stream);
...
}
Once you have an XDocument you can then use it to omit the bytes without the BOM:
using (var stream = new MemoryStream())
using (var writer = XmlWriter.Create(stream))
{
writer.Settings.Encoding = new UTF8Encoding(false);
document.WriteTo(writer);
var bytesWithoutBOM = stream.ToArray();
}
Solution 2
You could do something like this to skip the BOM bytes while reading from a stream. You would need to extend the Bom.cs to include further encodings, however afaik UTF is the only encoding using BOM... could (most likely) be wrong about that though.
I got the info on the encoding types from here
using (var stream = File.OpenRead("path_to_file"))
{
stream.Position = Bom.GetCursor(stream);
}
public static class Bom
{
public static int GetCursor(Stream stream)
{
// UTF-32, big-endian
if (IsMatch(stream, new byte[] {0x00, 0x00, 0xFE, 0xFF}))
return 4;
// UTF-32, little-endian
if (IsMatch(stream, new byte[] { 0xFF, 0xFE, 0x00, 0x00 }))
return 4;
// UTF-16, big-endian
if (IsMatch(stream, new byte[] { 0xFE, 0xFF }))
return 2;
// UTF-16, little-endian
if (IsMatch(stream, new byte[] { 0xFF, 0xFE }))
return 2;
// UTF-8
if (IsMatch(stream, new byte[] { 0xEF, 0xBB, 0xBF }))
return 3;
return 0;
}
private static bool IsMatch(Stream stream, byte[] match)
{
stream.Position = 0;
var buffer = new byte[match.Length];
stream.Read(buffer, 0, buffer.Length);
return !buffer.Where((t, i) => t != match[i]).Any();
}
}
Solution 3
You don't have to worry about BOM.
If for some reason you need to use and XmlDocument object maybe this code can help you:
byte[] file_content = {wherever you get it};
XmlDocument xml = new XmlDocument();
xml.Load(new MemoryStream(file_content));
It worked for me when i tried to download an xml attachment from a gmail account using Google Api and the file have BOM and using Encoding.UTF8.GetString(file_content) didn't work "properly".
Ravi Gupta
Updated on June 20, 2022Comments
-
Ravi Gupta almost 2 years
I have
xml
data inbyte[] byteArray
which may or mayn't contain BOM. Is there any standard way in C# to remove BOM from it? If not, what is the best way, which handles all the cases including all types of encoding, to do the same?Actually, I am fixing a bug in the code and I don't want to change much of the code. So it would be better if someone can give me the code to remove BOM.
I know that I can do like find out
60
which is ASCII value of '<' and ignore bytes before that but I don't want to do that. -
Ravi Gupta about 11 yearsactually i want to remove BOM only and don't have to care about parsing and all. I have updated the question as well.
-
Rich O'Kelly about 11 years@RaviGupta I see, do you know the encoding?
-
Ravi Gupta about 11 yearsit would be better if the logic be encoding free.
-
Rich O'Kelly about 11 years@RaviGupta Answer updated. There may be a more efficient way, perhaps looking at the internals of XmlReader to see how they detect the BOM, however what I have written above should work fine.
-
Ravi Gupta about 11 yearscan we do it for all encoding? like instead of doing
writer.Settings.Encoding = new UTF8Encoding(false);
can we dowriter.Settings.Encoding = new Encoding ....
something like that -
Rich O'Kelly about 11 years@RaviGupta The above code will 'normalise' the encoding to be UTF8. An encoding must be specified when writing out the bytes, you can choose an alternate however, UTF8 was chosen arbitrarily.