Is ISO-8859-1 a Unicode charset?

14,378

Solution 1

No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.

Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.

Solution 2

ISO 8859-1 is not Unicode

ISO 8859-1 is also known as Latin-1. It is not directly a Unicode format.

However, it does have the unique privilege that its code points 0x00 .. 0xFF map one-to-one to the Unicode code points U+0000 .. U+00FF. So, the first 256 code points of Unicode, treated as 1 byte unsigned integers, map to ISO 8859-1.

Control characters

Peregring-lk observes that ISO 8859-1 does not define the control codes. The Unicode charts for U+0000..U+007F and U+0080..U+00FF suggest that the C0 controls found in positions U+0000..U+001F and U+007F come from ISO/IEC 6429:1992 and the C1 controls found in positions U+0080..U+9F likewise. Wikipedia on the C0 and C1 controls suggests that the standard is ISO/IEC 2022 instead. Note that three of the C1 controls do not have a formal name.

In general parlance, the control code points of the ISO 8859-1 code set are assumed to be the C0 and C1 controls from ISO 6429 (or 2022).

Solution 3

ISO-8859-1 contains a subset of UTF-8 Unicode, which substantially overlaps with ASCII.

All ASCII is UTF-8 Unicode.

All the ISO 8859-1 (ISO Latin 1) characters below codes 7f hex are ASCII compatible and UTF-8 compatible in one byte. The ligatures and characters with diacritics use multi-byte Unicode UTF-8 representations, and use Unicode compatibility codepoints.

All UTF-8 single-byte character are contained in ASCII.

UTF-8 also contains multi-byte sequences, some of which are collatable (i.e. sortable) equivalents - composed equivalents - of the characters represented by compatibility codepoints, and some of which are the characters represented by all other characters sets other than ASCII and ISO Latin 1.

Solution 4

No. ISO/IEC 8859-1 is older than Unicode. For example, you won't find € in it. Unicode is compatible to ISO 8859-1 up to some point. For the coding of characters in Unicode look at UCS / UTF8 / UTF16.

If you look at code formats you have something like

Abstract letters - The letters you are using
Code table - Bring the letters in some form (like alphabetic ordering)
Code format - Say which position in the code table is which letter, (that is the UTF8 or UTF16 encoding)
Code schema - If you use more words for accessing a code position, in which order are they? (Big Endian, Little Endian in UTF16) [character encoding of steering instruction (e.g. < in XML)]

Solution 5

It depends on how you define "Unicode format."

I think most people would take it to mean an encoding capable of representing any codepoint in Unicode's range (U+0000 - U+10FFFF).

In that case, no, ISO 8859-1 is not a Unicode format.

However some other definitions might be 'a character set that is a subset of the Unicode character set,' or 'an encoding that can be considered to contain Unicode data (not necessarily arbitrary Unicode data).' ISO 8859-1 meets both of these definitions.

Unicode is a number of things. It contains a character set, in which 'characters' are assigned codepoint values. It defines properties for characters and provides a database of characters and their properties. It defines many algorithms for doing various things with Unicode text data, such as ways of comparing strings, of dividing strings into grapheme clusters, words, etc. It defines a few special encodings that can encode any Unicode codepoint and have some other useful properties. It defines mappings between Unicode codepoints and codepoints of legacy character sets.

Here you can find a more complete answer: Unicode.org

View more solutions

14,378

Author by

mdup

I'm a Machine Learning PhD. I currently work on ML for cybersecurity at Cisco. More...

Updated on June 15, 2022

Comments

mdup almost 2 years

I have been attending a lecture on XML where it was written "ISO-8859-1 is a Unicode format". It sounds wrong to me, but as I research on it, I struggle understanding precisely what Unicode is.

Can you call ISO-8859-1 a Unicode format ? What can you actually call Unicode ?