Node JS Detect string encoding

12,338

There's no such thing as a CP437-encoded String in [Node]JS. Strings are always Unicode (well, UTF-16 code units).

What you have in ¨Quin ha enga¤ado is String that has been decoded from bytes using the wrong encoding at some point in the past (aka mojibake). You need to find where that String came from, and change the encoding that was used to convert it from bytes.

It is sometimes possible to rescue a badly-decoded string by encoding back to a Buffer using the same encoding as was wrongly used to decode it, and then decoding it again with the right encoding this time. But this only works when all the bytes used happen to have mappings in the wrongly-used code page, and there is no further damage to the string.

It looks like you have a string that has been decoded using ISO-8859-1, so in principle you could encode it as ISO-8859-1 (eg new Buffer(s, 'binary')) and then decode the buffer as cp437 (unfortunately this encoding is not available in Node so you need a third-party module such as iconv-lite).

However, your string has suffered further damage in that the é has completely disappeared. That could be because the misdecoded character for that byte is an invisible control character that StackOverflow doesn't allow to be posted, or it could be because the that control character has been lost somewhere up the chain. If so, you cannot recover the original string at all.

I wish to dynamically detect the encoding type

There is no general way to automatically detect the encoding of a buffer, only vague heuristics (see the chardet module for an implementation of this). This is doubly difficult when you have mojibake, because you have to guess both the real encoding, and the wrongly-applied encoding.

You can burn a lot of time trying to detect common patterns but ultimately you will never have a reliable solution. After all, ¨Quin ha enga¤ado is a perfectly valid sequence of characters already, how would your code know that wasn't what was meant?

Much better to fix the bug further up, where the bad decode actually happened.

Share:
12,338
alpha_cod
Author by

alpha_cod

Updated on June 07, 2022

Comments

  • alpha_cod
    alpha_cod almost 2 years

    How to detect string encoding in Node JS and convert the string into a valid unicode string.

    For example, how do I detect a CP437 encoded string and convert it into a valid unicode string.

    Input: ¨Quin ha enga¤ado

    Output: ¿Quién ha engañado

    I wish to dynamically detect the encoding type and convert the string into a valid unicode string. Thanks in advance.

  • alpha_cod
    alpha_cod over 8 years
    Thanks for your suggestions. This information is actually crawled from the web, and there's no control over the source info since its all from open websites.
  • Amit Patil
    Amit Patil over 8 years
    When you're scraping, you want to determine/guess the encoding of the page at the point you download it. There are some examples in [this question(stackoverflow.com/questions/12326688/node-js-scrape‌​-encoding) if you are using request.
  • Stas Arshanski
    Stas Arshanski over 6 years
    If you know that language of the document. You can run an encoding conversion of a list of encodings to the same list (A->A, A->B, A->C etc.. ) And than check that the resulting text does not have any other charecters than the characters allowed in the document language.
  • jcubic
    jcubic almost 3 years
    Do you know any way to detect if Buffer instance is encoded in CP437? I can use iconv to decode it but first I need to detect if it's CP437 or not. I've checked two 3rd parties and one detect that the file is ASCII and another that it's UTF16 and it was fuzzy because it detect like 5 different encodings and none was CP437. In PHP there are functions to detect the encoding.