Adding UTF-8 BOM to string/Blob
Solution 1
Prepend \ufeff
to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff
is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF- 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.
Solution 2
I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array
prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain
with your desired MIME type.
Solution 3
I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.
Solution 4
This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
Related videos on Youtube

kay
Working at my alma mater, where I studied CS in order to aid the forces of light, and thwart the forces of darkness. I'm fluent in Python, C++, C, Cython, JavaScript, CSS, HTML, (and Java if I have to). At one point I knew Haskell, Pascal, Erlang, Prolog, Matlab, but forgot most about it.
Updated on July 05, 2022Comments
-
kay 4 months
I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using
new Blob(['\xEF\xBB\xBF' + content])
yields'"my data"'
, of course.Neither did
'\uBBEF\x22BF'
work (with'\x22' == '"'
being the next character incontent
).Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.
-
Jeff Fischer almost 8 yearsWell, if you look at the byte order mark and what I originally said, it's right. The FEFF byte order mark is not the byte order mark for UTF-8 as you stated in your question. The original answer seems to have stumbled onto the right answer or at least didn't elaborate at all. The only reason they got it right is because the options encoding defaults to utf-8. Not because the byte order mark they supplied is actually a UTF-8 byte order mark.
-
Jeff Fischer almost 8 yearslol, well, someone else will want to actually know how it works. Since the original answer doesn't describe why a UTF16LE BOM magically works. Someone in the future will want to actually understand what the heck is happening.
-
Jeff Fischer almost 8 yearsFeel free to remove your mark down of my answer. It's not wrong.
-
Casey over 7 yearsI'm a bit confused by this since the question doesn't mention node at all.
-
Jeff Fischer over 7 yearsYes, I assumed that the original question was not a browser question. I assumed that they were experiencing the exact same issue that I was experiencing, within node.
-
Casey over 7 yearsIt's not really specific to Node at all; I think you're a bit confused about the byte order mark.
-
Casey over 7 yearsSpecifically, you can see here that the BOM is always the same character (U+FEFF), and not a different character depending on what type of Unicode or endianness the text is in. It's true that the bytes written are different but that's because the same character is being written with different encodings.
-
KyleFarris about 6 yearsDude... yes. This works perfectly. Thanks! There are so many wrong/non-working answers on other questions.
-
carlosrafaelgn almost 6 yearsA warning for anyone else reading this: watch out, as
\ufeff
is actually the UTF-16 BOM and not the UTF-8 BOM en.wikipedia.org/wiki/Byte_order_mark -
Erik Töyrä Silfverswärd almost 6 yearsAdded some more details to the accepted answer to elaborate on why this works. Feel free to edit as you see fit.
-
Nehal Soni over 5 yearsGreat solution. Thanks @erik-töyrä
-
menepet over 5 yearsGreat peace of code for the BOM encoding and works great! @carlosrafaelgn You are right ... I want to make a tsv file with tab separators and the tab charatacter for UTF-8 is /t .The same char as UTF-16 BE (BOM) is not working and i cannot find the corresponding char ... Do you know where to find or what char is the \t Thank you ... !
-
carlosrafaelgn over 5 years@mEnE since \t (codepoint U+0009) is < 127, \t is 0x09 in UTF-8, just as it is in UTF-16 (0x0009). The only difference is the order the bytes are stored physically. In UTF-8 0x09. In UTF-16 LE 0x09, 0x00. In UTF-16 BE 0x00, 0x09.
-
Vbakke almost 5 yearsJust a small clarification: The character \uFEFF is the BOM character for all UTFs (8, 16 LE and 16 BE). However, it is encoded as bytes: - 0xEF 0xBB 0xBF - 0xFF 0xFE - 0xFE 0xFF respectively. It's important to distinguish the internal unicode character (\ufeff), and the various ways representing that one character, in bytes. :)
-
duma over 4 yearsHoly crap, this worked!! I used it in an HTML doc I was sending to my Kindle. THANK YOU Erik!
-
Timothy Zorn over 3 yearsThis is the correct way to do it when using
Blob
or working with actual bytes instead of JS strings. Erik and Jeff's answers are correct when you're using JS strings and not actual bytes. -
Cardin over 3 yearsMight not be specific to Node. stackoverflow.com/questions/6002256/… A few people tried it in .NET and Java, and it worked too.
-
no-stale-reads over 1 yearthanks a lot. i've been searching for this a while!!
-
Bryan Lee over 1 yearCan you explain why this works please, and is utf-18 even a valid encoding