What is the difference between UTF-8 and ISO-8859-1?

488,342

Solution 1

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

Solution 2

Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding. Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.

Solution 3

UTF

UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be representative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points.

Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particulars of the encoding best explained here.

ISO-8859

ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.

The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.

Windows Code Pages

It's worth mentioning that Microsoft also maintains a set of character encodings with limited compatibility with ISO-8859, usually denoted as "cp####". MS seems to have a push to move their recent product releases to using Unicode in one form or another, but for legacy and/or interoperability reasons you're still likely to run into them.

For example, cp1252 is a superset of the ISO-8859-1, containing additional printable characters in the 0x80-0x9F range, notably the Euro symbol and the much maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1, but will misbehave when one of those extra symbols shows up.

Aside from cp1252, the Turkish cp1254 is a similar superset of ISO-8859-9, but all other Windows Code Pages have at least some fundamental conflicts, if not differing entirely from their 8859 equivalent.

Solution 4

  • ASCII: 7 bits. 128 code points.

  • ISO-8859-1: 8 bits. 256 code points.

  • UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.

Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:

#!/usr/bin/env python3

c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))

Output:

©
b'\xc2\xa9'
b'\xa9'

Solution 5

ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).

Share:
488,342
Jagadesh
Author by

Jagadesh

Updated on August 03, 2022

Comments

  • Jagadesh
    Jagadesh over 1 year

    What is the difference between UTF-8 and ISO-8859-1?

  • StaxMan
    StaxMan about 12 years
    @mu maybe my statement was ambiguous, but it is not incorrect -- I was not talking about encoded byte sequences, but rather character sets being encoded; meaning that ISO-8859-1 is used to encode first 256 code points of the Unicode character set.
  • mu is too short
    mu is too short about 12 years
    Your clarification works for me and "ambiguous" would have been a better word choice than "incorrect".
  • Klaider
    Klaider over 6 years
    Helpful, but I think you meant 127 instead of 255 in extended-ascii 255?
  • Marlin Pierce
    Marlin Pierce almost 6 years
    Latin-1, or iso-8859-1 is not 100% compatible to be stored in utf8. Any Latin-n or iso-8859-n character above 127 will not be translated to a single byte utf-8 character. However, for values 1-127, they will translate exactly.
  • Hritik
    Hritik almost 6 years
    One thing to note that ASCII extends from 0 to 127 only. The MSB is always 0.
  • rdb
    rdb over 5 years
    This answer is a bit confusing in its use of the term "extended ascii", which just is a term to refer to any character encoding that is not ASCII. UTF-8 and latin-1 are examples of extended-ASCII encodings. But, non-ascii latin-1 characters (ie. code points above 127) cannot be encoded as a single byte in UTF-8.
  • Aggie Jon of 87
    Aggie Jon of 87 over 5 years
    I had seen where Umlaut's are not supposedly converted with UTF8. We saw examples of this and in searching we found the ISO-8859-1 and it seems to work. We have a lot of German Scientist we work with.
  • Erik Aronesty
    Erik Aronesty about 5 years
    Umlaut's are represented as two characters in utf8. They convert fine and work well. The problem comes from programs that expect 1 byte per character. For these legacy programs, ISO-8859-1 has 1-byte umlaut's.
  • Tom Loredo
    Tom Loredo almost 5 years
    +1 for answering the question but going beyond and offering info about related encodings. Re: code points for UTF-8, according to stackoverflow.com/a/38488358/3353984, UTF-8 supports 2^21 code points. Is that an error, or might a fix be needed here?
  • Rohan Bhale
    Rohan Bhale over 4 years
    When code points above 127 are defined, the encoding system is a version of Extended ASCII.
  • georgeawg
    georgeawg almost 4 years
    Unicode is actually 17 planes of 2^16 code points. 0x00_0000 to 0x1F_FFFF. The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.about 1 million. See How many characters can UTF-8 encode?.
  • Mr Lister
    Mr Lister over 3 years
    @RohanBhale Don't use the phrase Extended ASCII; it'll only cause confusion.
  • Rohan Bhale
    Rohan Bhale over 3 years
    But extended ascii might be the correct term. I read it on multiple resources
  • Chris Morgan
    Chris Morgan over 3 years
    Oops! Thought I’d written that, but I lost it in a rewrite. I’ve put it in now.
  • silicontrip
    silicontrip about 3 years
    In UTF-8 2 byte encodings begin at 128. However there are matching characters in both, so it is possible to go: ISO 8859-1 -> UTF-8 -> ISO 8859-1 losslessly but if there are any characters in a UTF-8 document greater than 255 then it cannot be converted losslessly.
  • AndreasRu
    AndreasRu over 2 years
    "So in other words, don't use it." I wouldn's say so, because there are use cases where ISO-8859-1 suits much better then UTF-8 because single byte and 256 chars can be sufficient, resulting in faster processing and less payload.
  • Caleb McNevin
    Caleb McNevin over 2 years
    Just as an example of where single byte encoding is preferred, SMS messages have a limit of 140 bytes and primarily use single-byte encoding. If you were a business that sends automated SMS messages, you don't want to double your cost just to not use a legacy standard.
  • CYPS84
    CYPS84 over 1 year
    I always heard it as High ASCII.