Adding UTF-8 BOM to string/Blob

63,573

Solution 1

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx

See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.

See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF- 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Solution 2

I had the same issue and this is the solution I came up with:

var blob = new Blob([
                    new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
                    "Text",
                    ... // Remaining data
                    ],
                    { type: "text/plain;charset=utf-8" });

Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).

You should replace text/plain with your desired MIME type.

Solution 3

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.

The short answer is, yes, this code works.

The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.

http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.

UTF-8 Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
   /* The actual byte order mark written to the file is EF BB BF */
}

UTF-16 Little Endian Example:

fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
   /* The actual byte order mark written to the file is FF FE */
}

So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.

I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

Solution 4

This is my solution:

var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
Share:
63,573

Related videos on Youtube

kay
Author by

kay

Working at my alma mater, where I studied CS in order to aid the forces of light, and thwart the forces of darkness. I'm fluent in Python, C++, C, Cython, JavaScript, CSS, HTML, (and Java if I have to). At one point I knew Haskell, Pascal, Erlang, Prolog, Matlab, but forgot most about it.

Updated on July 05, 2022

Comments

  • kay
    kay 4 months

    I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?

    Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.

    Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).

    Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?

    Yes, I really do need the UTF-8 BOM in this case.

  • Jeff Fischer
    Jeff Fischer almost 8 years
    Well, if you look at the byte order mark and what I originally said, it's right. The FEFF byte order mark is not the byte order mark for UTF-8 as you stated in your question. The original answer seems to have stumbled onto the right answer or at least didn't elaborate at all. The only reason they got it right is because the options encoding defaults to utf-8. Not because the byte order mark they supplied is actually a UTF-8 byte order mark.
  • Jeff Fischer
    Jeff Fischer almost 8 years
    lol, well, someone else will want to actually know how it works. Since the original answer doesn't describe why a UTF16LE BOM magically works. Someone in the future will want to actually understand what the heck is happening.
  • Jeff Fischer
    Jeff Fischer almost 8 years
    Feel free to remove your mark down of my answer. It's not wrong.
  • Casey
    Casey over 7 years
    I'm a bit confused by this since the question doesn't mention node at all.
  • Jeff Fischer
    Jeff Fischer over 7 years
    Yes, I assumed that the original question was not a browser question. I assumed that they were experiencing the exact same issue that I was experiencing, within node.
  • Casey
    Casey over 7 years
    It's not really specific to Node at all; I think you're a bit confused about the byte order mark.
  • Casey
    Casey over 7 years
    Specifically, you can see here that the BOM is always the same character (U+FEFF), and not a different character depending on what type of Unicode or endianness the text is in. It's true that the bytes written are different but that's because the same character is being written with different encodings.
  • KyleFarris
    KyleFarris about 6 years
    Dude... yes. This works perfectly. Thanks! There are so many wrong/non-working answers on other questions.
  • carlosrafaelgn almost 6 years
    A warning for anyone else reading this: watch out, as \ufeff is actually the UTF-16 BOM and not the UTF-8 BOM en.wikipedia.org/wiki/Byte_order_mark
  • Erik Töyrä Silfverswärd
    Erik Töyrä Silfverswärd almost 6 years
    Added some more details to the accepted answer to elaborate on why this works. Feel free to edit as you see fit.
  • Nehal Soni over 5 years
    Great solution. Thanks @erik-töyrä
  • menepet
    menepet over 5 years
    Great peace of code for the BOM encoding and works great! @carlosrafaelgn You are right ... I want to make a tsv file with tab separators and the tab charatacter for UTF-8 is /t .The same char as UTF-16 BE (BOM) is not working and i cannot find the corresponding char ... Do you know where to find or what char is the \t Thank you ... !
  • carlosrafaelgn over 5 years
    @mEnE since \t (codepoint U+0009) is < 127, \t is 0x09 in UTF-8, just as it is in UTF-16 (0x0009). The only difference is the order the bytes are stored physically. In UTF-8 0x09. In UTF-16 LE 0x09, 0x00. In UTF-16 BE 0x00, 0x09.
  • Vbakke
    Vbakke almost 5 years
    Just a small clarification: The character \uFEFF is the BOM character for all UTFs (8, 16 LE and 16 BE). However, it is encoded as bytes: - 0xEF 0xBB 0xBF - 0xFF 0xFE - 0xFE 0xFF respectively. It's important to distinguish the internal unicode character (\ufeff), and the various ways representing that one character, in bytes. :)
  • duma
    duma over 4 years
    Holy crap, this worked!! I used it in an HTML doc I was sending to my Kindle. THANK YOU Erik!
  • Timothy Zorn
    Timothy Zorn over 3 years
    This is the correct way to do it when using Blob or working with actual bytes instead of JS strings. Erik and Jeff's answers are correct when you're using JS strings and not actual bytes.
  • Cardin
    Cardin over 3 years
    Might not be specific to Node. stackoverflow.com/questions/6002256/… A few people tried it in .NET and Java, and it worked too.
  • no-stale-reads over 1 year
    thanks a lot. i've been searching for this a while!!
  • Bryan Lee
    Bryan Lee over 1 year
    Can you explain why this works please, and is utf-18 even a valid encoding