Is UTF-8 the encoding of choice for QR-codes with non ASCII chars by now?

11,825

Solution 1

The specification says that ISO-8859-1 is the default for byte-mode encoding. However in practice, yes, you'll see a lot of Shift-JIS in Japan, or UTF-8.

UTF-8 is the right choice. To do it properly, you need to put some indication in the stream that it's UTF-8. The spec does allow for this. You need to precede the byte segment with an ECI segment that indicates UTF-8.

The zxing encoder will do that for you if you send it a hint that the encoding is UTF-8.

Solution 2

BOM does not help

My experience shows that BOM does not help. If a QR scanner cannot display a string from a properly encoded UTF-8 string (8-bit byte mode in the data stream), even with an ECI, adding a BOM does not make any difference.

Scanners fail even on properly encoded UTF-8

As an example of a scanner that cannot display a proper UTF-8 string, take Xiaomi phones with MIUI Global v11.0.3 (with their native scanner application). These phones cannot correctly show a string of Cyrillic characters encoded in UTF-8 even if this charset is specified in the ECI. The Cyrillic characters are shown as question marks. But if you add a Chinese/Japanese character (e.g. 日) to the Cyrillic text, the whole text will be displayed correctly by Xiaomi. This is regardless of BOM.

These are actual characters that matter, not the encoding

You have supposed that it is better to use UTF-8 instead of ISO-8859-1 in QR codes, because ISO-8859-1 was not the default encoding in earlier QR code standard published in 2000 (ISO/IEC 18004:2000). That standard did specify 8-bit Latin/Kana character set in accordance with JIS X 0201 (JIS8 also known as ISO-2022-JP) as default encoding for 8-bit mode, while the updated standard published in 2005 did change the default to ISO-8859-1. So, you have supposed that “it mostly does not work to use iso-8859 for encoding”. It depends on whether US-ASCII characters should be enough for you (to be specific, the printable ANSI X3.4-1986 characters in the range of 20-7E) and you do not need ISO-8859-1 characters with umlaut/diaeresis used in languages such as Catalan, French, Galician, German, Occitan and Spanish.

If you only need US-ASCII, then it is safe to use ISO-8859-1 without any ECI rather than UTF-8 with an ECI. Anyway, the octet string of US-ASCII characters in range of 20-7E will be the same whether it is encoded as ISO-8859-1 or UTF-8. The heuristics software used by scanners should be able to automatically figure out the character set used if you are only using the US-ASCII characters. If you need characters with umlaut/diaeresis, then go with UTF-8. This is not because of default encoding has changed from JIS X 0201 to ISO-8859-1 between 2000 and 2005 revisions of the QR code standard, but because QR scanners use heuristics to automatically detect the encoding, and this heuristics in some cases fail.

Why QR scanners use heuristics to detect encoding

As you know, there are 4 modes of storing text in a QR code: (1) numeric, (2) alphanumeric, (3) 8-bit, and (4) Kanji.

So, QR code standard does not inherently support UTF-8. To use UTF-8 encoding (instead of the default “ISO-8859-1” or “JIS8”) in the 8-bit string, the implementation has to insert an ECI (Extended Channel Interpretations) before that string. ECI is an optional, additional feature for a QR Code, but it was defined in earliest QR code standard at least in 2000. ECI enables data encoding using character sets other than the default. It also enables other data interpretations (e.g. compacted data using defined compression schemes) or other industry-specific requirements to be encoded.

The ECI protocol is defined in a specification developed by AIM, Inc, and is not available for free but can be purchased at $50 at https://www.aimglobal.org/technical-symbology.html

Scanners may ignore the ECI protocol

Unfortunately, not all QR scanners can handle the ECI protocol, even in such a basic thing as changing default encoding to UTF-8. Most implementations use heuristics, i.e. one or another character encoding detection algorithm for guessing the encoding, even if the encoding is specified explicitly in the ECI of the decoded QR code. They use heuristics not only due to the change in default encoding from JIS8 to ISO-8859-1 between 2000 and 2005. The main reason is lack of proper ECI protocol support, probably caused by the fact that the QR code specification and the AIM ECI protocol specification are different documents. Some QR encoders do not specify character encoding via ECI and use different encodings for a 8-bit string (JIS8, Shift_JIS, ISO-8859-1, UTF-8), so the scanners have to cope with that.

You wrote that “it seems like utf-8 is the only choice”, but the scanner use heuristics that may fail even with UTF-8 as in the Xiaomi example I have given. You also wrote thet UTF-8 “is against the specification”, but this is so only if UTF-8 encoding is not explicitly specified via ECI.

An alternative to ECI and UTF-8, but not a complete cure

P.S. There is an alternative to using ECI. You can encode Latin characters with umlaut/diaeresis or Cyrillic characters using the “Kanji” mode. In this mode, the “Shift_JIS” is used to encode JIS X 0208 characters in ranges 8140-9FFC and E040-EBBF. Here you cannot encode characters in other ranges like space by byte code 20 but you can instead encode it as JIS X 0208 row 1 column 21, i.e. 2121). Since JIS X 0208 has rows for Roman (row 3), Greek (row 6) and Cyrillic (row 7), as well as special characters like punctuation (rows 1 & 2), you can encode Latin characters with umlaut/diaeresis or Cyrillic text (including spaces and punctuation) entirely in JIS character ranges 8140-9FFC and E040-EBBF. No ECI extension is needed in this case. But there is no guarantee that the heuristics in the scanner software will not break your properly encoded text.

Conclusion

Using UTF-8 and specifying it via ECI is not a complete cure (because some scanners will use error-prone heuristics in this case anyway), but at least it helps with compliant scanners, unlike BOM that does not help at all.

Share:
11,825
Gonzo
Author by

Gonzo

Updated on June 15, 2022

Comments

  • Gonzo
    Gonzo almost 2 years

    Google uses UTF-8 it as default for their very popular encoder. From what I can see they don't even add the byte order mark.

    The problem is that most scanners still seem to use JIS8 (QR 2000) instead of iso-8859 (QR 2005) as default, so it mostly does not work to use iso-8859 for encoding.

    It seems like utf-8 is the only choice even if it is against the specification.

    edit: I will go with utf-8 without ECI and without BOM. Against all spec and spirit but works best at the moment.