Adding UTF-8 BOM to string/Blob

javascript utf-8 blob fileapi byte-order-mark

63,573

Solution 1

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx

See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.

See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF- 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Solution 2

I had the same issue and this is the solution I came up with:

var blob = new Blob([
                    new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
                    "Text",
                    ... // Remaining data
                    ],
                    { type: "text/plain;charset=utf-8" });

Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).

You should replace text/plain with your desired MIME type.

Solution 3

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.

The short answer is, yes, this code works.

The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.

http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.

UTF-8 Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
   /* The actual byte order mark written to the file is EF BB BF */
}

UTF-16 Little Endian Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
   /* The actual byte order mark written to the file is FF FE */
}

So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.

I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

Solution 4

This is my solution:

var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});

View more solutions

63,573

kay

Working at my alma mater, where I studied CS in order to aid the forces of light, and thwart the forces of darkness. I'm fluent in Python, C++, C, Cython, JavaScript, CSS, HTML, (and Java if I have to). At one point I knew Haskell, Pascal, Erlang, Prolog, Matlab, but forgot most about it.

Updated on July 05, 2022

Comments

kay 4 months

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?

Using new Blob(['\xEF\xBB\xBF' + content]) yields 'ï»¿"my data"', of course.

Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).

Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?

^{Yes, I really do need the UTF-8 BOM in this case.}
Jeff Fischer almost 8 years

Well, if you look at the byte order mark and what I originally said, it's right. The FEFF byte order mark is not the byte order mark for UTF-8 as you stated in your question. The original answer seems to have stumbled onto the right answer or at least didn't elaborate at all. The only reason they got it right is because the options encoding defaults to utf-8. Not because the byte order mark they supplied is actually a UTF-8 byte order mark.
Jeff Fischer almost 8 years

lol, well, someone else will want to actually know how it works. Since the original answer doesn't describe why a UTF16LE BOM magically works. Someone in the future will want to actually understand what the heck is happening.
Jeff Fischer almost 8 years

Feel free to remove your mark down of my answer. It's not wrong.
Casey over 7 years

I'm a bit confused by this since the question doesn't mention node at all.
Jeff Fischer over 7 years

Yes, I assumed that the original question was not a browser question. I assumed that they were experiencing the exact same issue that I was experiencing, within node.
Casey over 7 years

It's not really specific to Node at all; I think you're a bit confused about the byte order mark.
Casey over 7 years

Specifically, you can see here that the BOM is always the same character (U+FEFF), and not a different character depending on what type of Unicode or endianness the text is in. It's true that the bytes written are different but that's because the same character is being written with different encodings.
KyleFarris about 6 years

Dude... yes. This works perfectly. Thanks! There are so many wrong/non-working answers on other questions.
carlosrafaelgn almost 6 years

A warning for anyone else reading this: watch out, as \ufeff is actually the UTF-16 BOM and not the UTF-8 BOM en.wikipedia.org/wiki/Byte_order_mark
Erik Töyrä Silfverswärd almost 6 years

Added some more details to the accepted answer to elaborate on why this works. Feel free to edit as you see fit.
Nehal Soni over 5 years

Great solution. Thanks @erik-töyrä
menepet over 5 years

Great peace of code for the BOM encoding and works great! @carlosrafaelgn You are right ... I want to make a tsv file with tab separators and the tab charatacter for UTF-8 is /t .The same char as UTF-16 BE (BOM) is not working and i cannot find the corresponding char ... Do you know where to find or what char is the \t Thank you ... !
carlosrafaelgn over 5 years

@mEnE since \t (codepoint U+0009) is < 127, \t is 0x09 in UTF-8, just as it is in UTF-16 (0x0009). The only difference is the order the bytes are stored physically. In UTF-8 0x09. In UTF-16 LE 0x09, 0x00. In UTF-16 BE 0x00, 0x09.
Vbakke almost 5 years

Just a small clarification: The character \uFEFF is the BOM character for all UTFs (8, 16 LE and 16 BE). However, it is encoded as bytes: - 0xEF 0xBB 0xBF - 0xFF 0xFE - 0xFE 0xFF respectively. It's important to distinguish the internal unicode character (\ufeff), and the various ways representing that one character, in bytes. :)
duma over 4 years

Holy crap, this worked!! I used it in an HTML doc I was sending to my Kindle. THANK YOU Erik!
Timothy Zorn over 3 years

This is the correct way to do it when using Blob or working with actual bytes instead of JS strings. Erik and Jeff's answers are correct when you're using JS strings and not actual bytes.
Cardin over 3 years

Might not be specific to Node. stackoverflow.com/questions/6002256/… A few people tried it in .NET and Java, and it worked too.
no-stale-reads over 1 year

thanks a lot. i've been searching for this a while!!
Bryan Lee over 1 year

Can you explain why this works please, and is utf-18 even a valid encoding