Java String.getBytes("UTF8") JavaScript analog
Solution 1
You can use this function (gist):
function toUTF8Array(str) {
var utf8 = [];
for (var i=0; i < str.length; i++) {
var charcode = str.charCodeAt(i);
if (charcode < 0x80) utf8.push(charcode);
else if (charcode < 0x800) {
utf8.push(0xc0 | (charcode >> 6),
0x80 | (charcode & 0x3f));
}
else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push(0xe0 | (charcode >> 12),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
else {
// let's keep things simple and only handle chars up to U+FFFF...
utf8.push(0xef, 0xbf, 0xbd); // U+FFFE "replacement character"
}
}
return utf8;
}
Example of use:
>>> toUTF8Array("中€")
[228, 184, 173, 226, 130, 172]
If you want negative numbers for values over 127, like Java's byte-to-int conversion does, you have to tweak the constants and use
utf8.push(0xffffffc0 | (charcode >> 6),
0xffffff80 | (charcode & 0x3f));
and
utf8.push(0xffffffe0 | (charcode >> 12),
0xffffff80 | ((charcode>>6) & 0x3f),
0xffffff80 | (charcode & 0x3f));
Solution 2
You don't need to write a full-on UTF-8 encoder; there is a much easier JS idiom to convert a Unicode string into a string of bytes representing UTF-8 code units:
unescape(encodeURIComponent(str))
(This works because the odd encoding used by escape
/unescape
uses %xx
hex sequences to represent ISO-8859-1 characters with that code, instead of UTF-8 as used by URI-component escaping. Similarly decodeURIComponent(escape(bytes))
goes in the other direction.)
So if you want an Array out it would be:
function toUTF8Array(str) {
var utf8= unescape(encodeURIComponent(str));
var arr= new Array(utf8.length);
for (var i= 0; i<utf8.length; i++)
arr[i]= utf8.charCodeAt(i);
return arr;
}
Solution 3
TextEncoder
is part of the Encoding Living Standard and according to the Encoding API entry from the Chromium Dashboard, it shipped in Firefox and will ship in Chrome 38. There is also a text-encoding polyfill available for other browsers.
The JavaScript code sample below returns a Uint8Array
filled with the values you expect.
(new TextEncoder()).encode("string")
// [115, 116, 114, 105, 110, 103]
A more interesting example that betters shows UTF-8 replaces the in
in string
with îñ
:
(new TextEncoder()).encode("strîñg")
[115, 116, 114, 195, 174, 195, 177, 103]
Related videos on Youtube
ivkremer
I'm a front-end web developer from Moscow, Russia. Currently I'm located in Berlin and working at Delivery Hero. My most passion is to develop common UI which is used by regular & random users (classifieds, online shops, social networks and so on) but I'm always open to consider other opportunities as well. I'm also interested in photography. That's why the first software project I've ever worked on was my own website: http://kremer.pro.
Updated on September 16, 2022Comments
-
ivkremer over 1 year
Functions written there work properly that is
pack(unpack("string"))
yields to"string"
. But I would like to have the same result as"string".getBytes("UTF8")
gives in Java.The question is how to make a function giving the same functionality as Java getBytes("UTF8") in JavaScript?
For Latin strings
unpack(str)
from the article mentioned above provides the same result asgetBytes("UTF8")
except it adds0
for odd positions. But with non-Latin strings it works completely different as it seems to me. Is there a way to work with string data in JavaScript like Java does?-
obataku@Kremchik JavaScript uses UTF-16, hence the
0
s -- they're the upper half of each 16-bit code unit. That Hanzhi character requires 3-bytes when encoded according to UTF-8 scheme while only 2-bytes via UTF-16.
-
-
masterxilo almost 8 yearsThis seems much better than the accepted bit-fiddling custom implementation that is the accepted answer.