Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

273,776

Solution 1

The Unicode Problem

Though JavaScript (ECMAScript) has matured, the fragility of Base64, ASCII, and Unicode encoding has caused a lot of headache (much of it is in this question's history).

Consider the following example:

const ok = "a";
console.log(ok.codePointAt(0).toString(16)); //   61: occupies < 1 byte

const notOK = "✓"
console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 byte

console.log(btoa(ok));    // YQ==
console.log(btoa(notOK)); // error

Why do we encounter this?

Base64, by design, expects binary data as its input. In terms of JavaScript strings, this means strings in which each character occupies only one byte. So if you pass a string into btoa() containing characters that occupy more than one byte, you will get an error, because this is not considered binary data.

Source: MDN (2021)

The original MDN article also covered the broken nature of window.btoa and .atob, which have since been mended in modern ECMAScript. The original, now-dead MDN article explained:

The "Unicode Problem" Since DOMStrings are 16-bit-encoded strings, in most browsers calling window.btoa on a Unicode string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit byte (0x00~0xFF).


Solution with binary interoperability

(Keep scrolling for the ASCII base64 solution)

Source: MDN (2021)

The solution recommended by MDN is to actually encode to and from a binary string representation:

Encoding UTF8 ⇢ binary

// convert a Unicode string to a string in which
// each 16-bit unit occupies only one byte
function toBinary(string) {
  const codeUnits = new Uint16Array(string.length);
  for (let i = 0; i < codeUnits.length; i++) {
    codeUnits[i] = string.charCodeAt(i);
  }
  return btoa(String.fromCharCode(...new Uint8Array(codeUnits.buffer)));
}

// a string that contains characters occupying > 1 byte
let encoded = toBinary("✓ à la mode") // "EycgAOAAIABsAGEAIABtAG8AZABlAA=="

Decoding binary ⇢ UTF-8

function fromBinary(encoded) {
  const binary = atob(encoded);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }
  return String.fromCharCode(...new Uint16Array(bytes.buffer));
}

// our previous Base64-encoded string
let decoded = fromBinary(encoded) // "✓ à la mode"

Where this fails a little, is that you'll notice the encoded string EycgAOAAIABsAGEAIABtAG8AZABlAA== no longer matches the previous solution's string 4pyTIMOgIGxhIG1vZGU=. This is because it is a binary encoded string, not a UTF-8 encoded string. If this doesn't matter to you (i.e., you aren't converting strings represented in UTF-8 from another system), then you're good to go. If, however, you want to preserve the UTF-8 functionality, you're better off using the solution described below.


Solution with ASCII base64 interoperability

The entire history of this question shows just how many different ways we've had to work around broken encoding systems over the years. Though the original MDN article no longer exists, this solution is still arguably a better one, and does a great job of solving "The Unicode Problem" while maintaining plain text base64 strings that you can decode on, say, base64decode.org.

There are two possible methods to solve this problem:

  • the first one is to escape the whole string (with UTF-8, see encodeURIComponent) and then encode it;
  • the second one is to convert the UTF-16 DOMString to an UTF-8 array of characters and then encode it.

A note on previous solutions: the MDN article originally suggested using unescape and escape to solve the Character Out Of Range exception problem, but they have since been deprecated. Some other answers here have suggested working around this with decodeURIComponent and encodeURIComponent, this has proven to be unreliable and unpredictable. The most recent update to this answer uses modern JavaScript functions to improve speed and modernize code.

If you're trying to save yourself some time, you could also consider using a library:

Encoding UTF8 ⇢ base64

    function b64EncodeUnicode(str) {
        // first we use encodeURIComponent to get percent-encoded UTF-8,
        // then we convert the percent encodings into raw bytes which
        // can be fed into btoa.
        return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
            function toSolidBytes(match, p1) {
                return String.fromCharCode('0x' + p1);
        }));
    }
    
    b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
    b64EncodeUnicode('\n'); // "Cg=="

Decoding base64 ⇢ UTF8

    function b64DecodeUnicode(str) {
        // Going backwards: from bytestream, to percent-encoding, to original string.
        return decodeURIComponent(atob(str).split('').map(function(c) {
            return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
        }).join(''));
    }
    
    b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
    b64DecodeUnicode('Cg=='); // "\n"

(Why do we need to do this? ('00' + c.charCodeAt(0).toString(16)).slice(-2) prepends a 0 to single character strings, for example when c == \n, the c.charCodeAt(0).toString(16) returns a, forcing a to be represented as 0a).


TypeScript support

Here's same solution with some additional TypeScript compatibility (via @MA-Maddin):

// Encoding UTF8 ⇢ base64

function b64EncodeUnicode(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode(parseInt(p1, 16))
    }))
}

// Decoding base64 ⇢ UTF8

function b64DecodeUnicode(str) {
    return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2)
    }).join(''))
}

The first solution (deprecated)

This used escape and unescape (which are now deprecated, though this still works in all modern browsers):

function utf8_to_b64( str ) {
    return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
    return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

And one last thing: I first encountered this problem when calling the GitHub API. To get this to work on (Mobile) Safari properly, I actually had to strip all white space from the base64 source before I could even decode the source. Whether or not this is still relevant in 2021, I don't know:

function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(escape(window.atob( str )));
}

Solution 2

Things change. The escape/unescape methods have been deprecated.

You can URI encode the string before you Base64-encode it. Note that this does't produce Base64-encoded UTF8, but rather Base64-encoded URL-encoded data. Both sides must agree on the same encoding.

See working example here: http://codepen.io/anon/pen/PZgbPW

// encode string
var base64 = window.btoa(encodeURIComponent('€ 你好 æøåÆØÅ'));
// decode string
var str = decodeURIComponent(window.atob(tmp));
// str is now === '€ 你好 æøåÆØÅ'

For OP's problem a third party library such as js-base64 should solve the problem.

Solution 3

The complete article that works for me: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Base64_encoding_and_decoding

The part where we encode from Unicode/UTF-8 is

function utf8_to_b64( str ) {
   return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
   return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

This is one of the most used methods nowadays.

Solution 4

If treating strings as bytes is more your thing, you can use the following functions

function u_atob(ascii) {
    return Uint8Array.from(atob(ascii), c => c.charCodeAt(0));
}

function u_btoa(buffer) {
    var binary = [];
    var bytes = new Uint8Array(buffer);
    for (var i = 0, il = bytes.byteLength; i < il; i++) {
        binary.push(String.fromCharCode(bytes[i]));
    }
    return btoa(binary.join(''));
}


// example, it works also with astral plane characters such as '𝒞'
var encodedString = new TextEncoder().encode('✓');
var base64String = u_btoa(encodedString);
console.log('✓' === new TextDecoder().decode(u_atob(base64String)))

Solution 5

Decoding base64 to UTF8 String

Below is current most voted answer by @brandonscript

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

Above code can work, but it's very slow. If your input is a very large base64 string, for example 30,000 chars for a base64 html document. It will need lots of computation.

Here is my answer, use built-in TextDecoder, nearly 10x faster than above code for large input.

function decodeBase64(base64) {
    const text = atob(base64);
    const length = text.length;
    const bytes = new Uint8Array(length);
    for (let i = 0; i < length; i++) {
        bytes[i] = text.charCodeAt(i);
    }
    const decoder = new TextDecoder(); // default is utf-8
    return decoder.decode(bytes);
}
Share:
273,776

Related videos on Youtube

brandonscript
Author by

brandonscript

Quantum foam traveler. Purveyor of opinions. Product design https://delving.com. UX &amp; Code. Advisor https://pacificaviator.co. Previously Google, Apigee. ADHD. Be kind. (He/him)

Updated on April 08, 2022

Comments

  • brandonscript
    brandonscript about 2 years

    I'm using the Javascript window.atob() function to decode a base64-encoded string (specifically the base64-encoded content from the GitHub API). Problem is I'm getting ASCII-encoded characters back (like ⢠instead of ). How can I properly handle the incoming base64-encoded stream so that it's decoded as utf-8?

    • Pointy
      Pointy almost 9 years
      The MDN page you linked has a paragraph starting with the phrase "For use with Unicode or UTF-8 strings,".
    • Bergi
      Bergi almost 9 years
      Are you on node? There are better solutions than atob
  • brandonscript
    brandonscript over 8 years
    Looks like the doc link is even different from this now, suggesting a regex solution to manage it.
  • bodo
    bodo about 8 years
    This will not work, because encodeURIComponent is the inverse of decodeURIComponent, i.e. it will just undo the conversion. See stackoverflow.com/a/31412163/1534459 for a great explanation of what is happening with escape and unescape.
  • Darkves
    Darkves about 8 years
    @canaaerus I don't understand your comment? escape and unescape are deprecated, I just swap them with [decode|encode]URIComponent function :-) Everything is work just fine. Read the question first
  • bodo
    bodo about 8 years
    @Darkves: The reason why encodeURIComponent is used, is to correctly handle (the whole range of) unicode strings. So e.g. window.btoa(decodeURIComponent(encodeURIComponent('€'))) gives Error: String contains an invalid character because it’s the same as window.btoa('€') and btoa can not encode .
  • Tedd Hansen
    Tedd Hansen about 8 years
    No point in arguing this: codepen.io/anon/pen/NxmRmj gives "Uncaught InvalidCharacterError: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range."
  • Tedd Hansen
    Tedd Hansen about 8 years
    w3schools.com/jsref/jsref_unescape.asp "The unescape() function was deprecated in JavaScript version 1.5. Use decodeURI() or decodeURIComponent() instead."
  • Neo
    Neo about 8 years
    You saved my days, bro
  • weeix
    weeix almost 8 years
    Update: Solution #1 in MDN's The "Unicode Problem" was fixed, b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); now correctly output "✓ à la mode"
  • Stefan Steiger
    Stefan Steiger almost 8 years
    @Darkves: Yes, that's correct. But you can't swap escape with EncodeURIComponent and unescape with DecodeURIComponent, because the Encode and the escape methods don't do the same thing. Same with decode&unescape. I originally made the same mistake, btw. You should notice that if you take a string, UriEncode it, then UriDecode it, you get the same string back that you inputted. So doing that would be nonsense. When you unescape a string encoded with encodeURIComponent, you don't get the same string back that you inputted, so that's why with escape/unescape it works, but not with yours.
  • Darkves
    Darkves over 7 years
    @Stefan Steiger look at Tedd Hansen comment. I made a mistake, and I'm sry. HF commenting around :)
  • daniel.gindi
    daniel.gindi over 7 years
    Another way to decode would be decodeURIComponent(atob('4pyTIMOgIGxhIG1vZGU=').split('').ma‌​p(x => '%' + x.charCodeAt(0).toString(16)).join('')) Not the most performant code, but it is what it is.
  • Crashalot
    Crashalot over 7 years
    The base64-js link is dead?
  • brandonscript
    brandonscript over 7 years
    Fixed. Thanks @Crashalot.
  • Riccardo Galli
    Riccardo Galli about 7 years
    I'd like to point out that you're not producing the base64 of the input string, but of his encoded component. So if you send it away the other party cannot decode it as "base64" and get the original string
  • Martin Schneider
    Martin Schneider almost 7 years
    return String.fromCharCode(parseInt(p1, 16)); to have TypeScript compatibility.
  • Tedd Hansen
    Tedd Hansen over 6 years
    You are correct, I have updated the text to point that out. Thanks. The alternative seems to be implementing base64 yourself, using a third party library (such as js-base64) or receiving "Error: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range."
  • Parth Jasani
    Parth Jasani over 6 years
    I have same issue can you please check it jsfiddle.net/parthjasani/hz5713b0/2
  • Ryan
    Ryan over 5 years
    Thanks. Your answer was crucial in helping me get this working, which took me many hours over multiple days. +1. stackoverflow.com/a/51814273/470749
  • brandonscript
    brandonscript over 5 years
    Can you elaborate on what you mean by "user-created way" vs. "interpretable by the browser"? What is the value-add of using this solution over, say, what Mozilla recommends?
  • Jack G
    Jack G over 5 years
    @brandonscript Mozilla is different from MDN. MDN is user-created content. The page on MDN that recommends your solution was user-created content, not browser vendor created content.
  • brandonscript
    brandonscript over 5 years
    Is your solution vendor created? I’d so, I’d suggest giving credit to the origin. If not, then it is also user-created, and no different than MDN’s answer?
  • Jack G
    Jack G over 5 years
    @brandonscript Good point. You are correct. I removed that piece of text. Also, check out the demo I added.
  • Jack G
    Jack G about 4 years
    For a much faster and more cross-browser solution (but essentially the same output), please see stackoverflow.com/a/53433503/5601591
  • Riccardo Galli
    Riccardo Galli about 4 years
    u_atob and u_btoa use functions available in every browser since IE10 (2012), looks solid to me (if you refer to TextEncoder, that's just an example)
  • Khanh Hua
    Khanh Hua over 3 years
    Works for me as I am trying to decode Github API response which contains German umlaut. Thank you!!
  • Oliver Joseph Ash
    Oliver Joseph Ash over 3 years
    It seems the MDN article has been updated and the explanation has now been moved here: developer.mozilla.org/en-US/docs/Web/API/…
  • ZalemCitizen
    ZalemCitizen over 3 years
    unescape seems about to become deprecated developer.mozilla.org/fr/docs/Web/JavaScript/Reference/…
  • Milad
    Milad about 3 years
    Why does this line first prepend '00' and then picks the last two chars? '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2) Why doesn't it simply do this? '%' + c.charCodeAt(0).toString(16)
  • yaya
    yaya about 3 years
    @brandonscript appending '00' and slicing was a bit confusing (for me and @Milad), so I added a comment about that on answer. can you take a look at it?
  • brandonscript
    brandonscript about 3 years
    Yeah I’m not sure either. And this keeps getting changed by ECMAScript and Mozilla, so the answer keeps changing. I don’t even see this solution in the link anymore. I’ll look at modernizing this answer since it’s such a high-traffic result.
  • Milad
    Milad about 3 years
    Thank you @yaya, I didn’t realize it was left padding! I wish IE supported str.padStart(2, '0'). Question: would the solution be functionally equivalent with that?
  • yaya
    yaya about 3 years
    @Milad I think they are same. the only difference is that .slice(-2) makes sure that max length is 2 (for example it converts 35f to 5f), but .padStart(2, '0') doesn't. but I think all utf-8 encoded characters are less than ff, otherwise, this solution wasn't totally correct. but to make sure it doesn't throw an error for unsupported outranged characters (if there are any), it's safer to not change it.
  • Milad
    Milad about 3 years
    @yaya in theory, even if the original characters occupied more than 1 byte (e.g., 'à'), when translated to base64 they become 1-byte character sequences (since base64 produces ascii chars), so it should never be longer than 2 characters - for example: 'à' -[base64]-> 'w6A=' -[atob]-> 'Ã ' -[%encoding]-> '%c3%a0' -[decodeURIComponent]-> 'à'
  • yaya
    yaya about 3 years
    @brandonscript thanks, but honestly i don't like the edit at all. previously there was a straightforward encode and decode function (2018-2021), but the 2021 solution doesn't have it, and it seems detailed and so long for busy developers. (I didn't understand it also.).
  • brandonscript
    brandonscript about 3 years
    @yaya I hate it too, tbh. The trouble is that the information provided was conflicting with MDN, and there are some reasonably good reasons why. I will continue to monitor the sentiment of this answer and update it to make sure it's as helpful as possible. I did just make some minor changes to improve the heading, and hopefully explain better why there are two answers now.
  • yaya
    yaya about 3 years
    @brandonscript thanks. i read it again and i get it know. the confusing part for me is that the first block of code doesn't contain the solution, and the first block of description also doesn't contain the solution. so maybe you can format it like : 1. (code description) first convert it to binary, then decode it. 2. code solution, the btoa(fromBinary(...)) code. 3. describing the problem and the code that describes the problem. or something like this. (it's just a suggestion, please don't apply it if you don't like it.)
  • brandonscript
    brandonscript about 3 years
    See what you think now. Good ideas.
  • brandonscript
    brandonscript about 3 years
    This is actually a pretty cool solution. I think it wouldn't have worked in the past, because atob and btoa were broken, but now they're not.
  • yaya
    yaya about 3 years
    @brandonscript Sorry for the late reply. (it's your post, so when you don't mention me with @, I don't get any notifications.). I think that's well-formatted now. the only problem is that Encoding UTF8 ⇢ binary part doesn't contain the usage code (let encoded = btoa(toBinary("✓ à la mode"))).
  • yaya
    yaya about 3 years
    @brandonscript and also maybe changing btoa(toBinary("✓")) to a single function Is more cool, like : binaryEncode("✓"). (just like base64 version)
  • yaya
    yaya about 3 years
    @Brandonscript thanks, now the only concern is the function name. I'm not sure but shouldn't it be like : b64BinaryEncode? (since you combined the toBinary and btoa). I'm not sure about it however.
  • Benoit Gauthier
    Benoit Gauthier almost 3 years
    Great answer, works great between js and php
  • Duc Manh Nguyen
    Duc Manh Nguyen over 2 years
    if i use b64EncodeUnicode(str) function in Javascript. How to Decode it in PHP? Can you convert function b64DecodeUnicode(str) to PHP function ?
  • Elpy
    Elpy over 2 years
    Exactly what I needed. My base64 encoded UTF-8 strings come from a Python script (base64.b64encode) and this makes it work with UTF-8 characters without changing anything on the Python side. Works like a charm!