Why do English characters require fewer bytes to represent than other alphabets?

unicode character-encoding special-characters

11,216

Solution 1

One of the first encoding schemes to be developed to use in mainstream computers is the ASCII (American Standard Code for Information Interchange) standard. It was developed in the 1960's in the United States.

The English alphabet uses part of the Latin alphabet (for instance, there are few accented words in English). There are 26 individual letters in that alphabet, not considering case. And there would also have to exist the individual numbers and punctuation marks in any scheme that pretends to encode the English alphabet.

The 1960's was also a time where computers didn't have the amount of memory or disk space that we have now. ASCII was developed to be a standard representation of a functional alphabet across all American computers. At the time, the decision to make every ASCII character to be 8 bits (1 byte) long was made due to technical details of the time (the Wikipedia article mentions the fact that perforated tape held 8 bits in a position at a time). In fact, the original ASCII scheme can be transmitted using 7 bits, the eight could be used for parity checks. Later developments expanded the original ASCII scheme to include several accented, mathematical and terminal characters.

With the recent increase of computer usage across the world, more and more people from different languages had access to a computer. That meant that, for each language, new encoding schemes had to be developed, independently from other schemes, which would conflict if read from different language terminals.

Unicode came as a solution to the existence of different terminals, by merging all possible meaningful characters into a single abstract character set.

UTF-8 is one way to encode the Unicode character set. It is a variable-width encoding (e.g. different characters can have different sizes) and it was designed for backwards compatibility with the former ASCII scheme. As such, the ASCII character set will remain to be one byte big whilst any other characters are two or more bytes big. UTF-16 is another way to encode the Unicode character set. In comparison to UTF-8, characters are encoded as either a set of one or two 16-bit code units.

As stated on comments, the 'a' character occupies a single byte while 'ա' occupies two bytes, denoting a UTF-8 encoding. The extra byte in your question was due to the existence of a newline character at the end (which the OP found out about).

Solution 2

1 byte is 8 bits, and can thus represent up to 256 (2^8) different values.

For languages that require more possibilities than this, a simple 1 to 1 mapping can't be maintained, so more data is needed to store a character.

Note that generally, most encodings use the first 7 bits (128 values) for ASCII characters. That leaves the the 8th bit, or 128 more values for more characters . . . add in accented characters, Asian languages, Cyrillic, etc, and you can easily see why 1 byte is not sufficient for keeping all characters.

Solution 3

In UTF-8, ASCII characters use one byte, other characters use two, three, or four bytes.

Solution 4

The amount of bytes required for a character (which the question is apparently about) depends on the character encoding. If you use the ArmSCII encoding, each Armenian letter occupies just one byte. It’s not a good choice these days, though.

In the UTF-8 transfer encoding for Unicode, characters need a different number of bytes. In it, “a” takes just one byte (the idea about two bytes is some kind of a confusion), “á” takes two bytes, and the Armenian letter ayb “ա” takes two bytes too. Three bytes must be some kind of a confusion. In contrast, e.g. Bengali letter a “অ” takes three bytes in UTF-8.

The background is simply that UTF-8 was designed to be very efficient for Ascii characters, fairly efficient for writing systems in Europe and surroundings, and all the rest is less efficient. This means that basic Latin letters (which is what English text mostly consists of), only one byte is needed for a character; for Greek, Cyrillic, Armenian, and a few others, two bytes are needed; all the rest needs more.

UTF-8 has (as pointed out in a comment) also the useful property that Ascii data (when represented as 8-bit units, which has been almost the only way for a long time) is trivially UTF-8 encoded, too.

Solution 5

Character codes in the 1960es (and long beyond) were machine-specific. In the 1980s I briefly used a DEC 2020 machine, which had 36 bit words, and 5, 6 and 8 (IIRC) bits per character encodings. Before that, I used an IBM 370 series with EBCDIC. ASCII with 7 bits brought order, but it got a mess with IBM PC "codepages" using all 8 bits to represent extra characters, like all sorts of box drawing ones to paint primitive menus, and later ASCII extensions like Latin-1 (8 bit encodings, with the first 7 bits like ASCII and the other half for "national characters" like ñ, Ç, or others. Probably the most popular was Latin-1, tailored to English and most european languages using Latin characters (and accents and variants).

Writing text mixing e.g. English and Spanish went fine (just use Latin-1, superset of both), but mixing anything that used a different encodings (say include a snippet of Greek, or Russian, not to mention an asian language like Japanese) was a veritable nightmare. Worst was that Russian and particularly Japanese and Chinese had several popular, completely incompatible encodings.

Today we use Unicode, which is cupled to efficient encodings like UTF-8 that favor English characters (surprisingly, the encoding for English letters just so happen to correspond to ASCII) thus making many non-English characters use longer encodings.

View more solutions

11,216

khajvah

Updated on September 18, 2022

Comments

khajvah over 1 year

When I put 'a' in a text file, it makes it 2 bytes but when I put, let's say 'ա', which is a letter from Armenian alphabet, it makes it 3 bytes.

What is the difference between alphabets for a computer?
Why does English take less space?
- Admin about 10 years
  
  You should read this article by the founder of StackExchange: joelonsoftware.com/articles/Unicode.html
- Admin about 10 years
  
  @Raphael everybody knows what he is referring to though. But nice add.
- Admin about 10 years
  
  Your problem is that you are using UTF-16 or something and not the better, more space-saving UTF-8.
- Admin about 10 years
  
  @Raphael I don't think there is such a thing as “Roman characters”. They are Latin.
- Admin about 10 years
  
  @Raphael What about 'merican characters?
- Admin about 2 years
  
  Yes, English or Latin chars are originally designed to fit ANSI character sets which takes less space, but whereas the glyphs in other languages represent more dimensions, so ideally it is like a long paragraph in English could be transformed into a few lines in other languages, for instance Devanagari script
MaQleod about 10 years

Can you elaborate on why this is? noting two encoding methods doesn't quite answer the question.
Jason about 10 years

@MaQleod Unicode was created to replace ASCII. For backwards compatibility, the first 128 characters are the same. These 128 characters can be expressed with one byte. Additional bytes are added for additional characters.
MaQleod about 10 years

I'm aware, but that is part of of the answer to the question as to what makes the ASCII characters different. It should be explained to the OP.
Joce NoToPutinsWarInUkraine about 10 years

The last byte codes the end of file.
Jason about 10 years

@MaQleod It could also be said that the Unicode Consortium was mostly comprised of American corporations and were biased towards English language characters. I thought a simple answer was better than an subjective one.
Doktoro Reichard about 10 years

That would make sense actually... although I don't see its effects in Notepad.
Joce NoToPutinsWarInUkraine about 10 years

Without the last byte, Notepad (or any other tool) wouldn't know when to stop reading from the storage medium. But end-of-file is not shown in such tools, of course.
Jukka K. Korpela about 10 years

There is no last byte that codes the end of file, in any normal encoding or file format. When a program reads a file, end of file might be signalled by the OS in a special way, but that’s a different issue.
khajvah about 10 years

I use linux if it helps.
Doktoro Reichard about 10 years

@Joce I understand that the EOF char isn't represented in Notepad; what I was referring to was Windows Explorer's representation of the size the file had, which by writing a char was at 1 byte. It would mean Explorer specifically forgets about the null char.
Dan Is Fiddling By Firelight about 10 years

The ա character is 2 bytes (0xD5A1) in the UTF-8 version of unicode; the extra character (whatever is is) is present in both files. marathon-studios.com/unicode/U0561/Armenian_Small_Letter_Ayb
HikeMike about 10 years

@khajvah If you echo 'ա' > file.txt it, or edit the file using some editors, they automatically add a newline after it. If you run xxd file.txt, the last byte will probably be a 0a, or line feed.
khajvah about 10 years

@DanielBeck Yes, that's the case. echo added new line in the end
khajvah about 10 years

Thank you for the answer. Additional bytes are because the program I used automatically added new line character to the end.
user1686 about 10 years

@DoktoroReichard: Please clarify in the answer that Unicode is not an encoding; rather, it's an abstract character set, and UTF-16 and UTF-8 are encodings of Unicode codepoints. The last paragraphs of your answer mostly talk about UTF-8. But if a file uses UTF-16, then any codepoint, even the one for a, will use two bytes (or a multiple of two).
ntoskrnl about 10 years

It's also probably worth emphasizing that the "extended ASCII" character sets are in fact not ASCII at all, and the number of different ways to utilize the eighth bit makes it all a big mess. Just use UTF-8 instead.
ntoskrnl about 10 years

Notepad (and Windows in general) uses confusing terminology here. "ANSI" is a locale-dependent single byte encoding (Windows-1252 on English versions), and "Unicode" is UTF-16.
Doktoro Reichard about 10 years

@ntoskrln The extensions that IBM (and many many others) made to the ASCII standard came about as a need to represent more things that the characters already present weren't able to, on the terminals at the time. Also, several European countries still use the provided character sets, despite the existence of Unicode.
MSalters about 10 years

No American bias. Unicode is an extension of ISO-8859-1; the first 256 characters are the same. In turn, ISO-8859-1 is an extension of ASCII because most of Europe needed ASCII as a subset.
mpez0 about 10 years

ASCII proper is 7 bits, not 8
Félix Gagnon-Grenier about 10 years

so here is the only answer actually explaining why more space is used
Sebastian Negraszus about 10 years

Not "in Unicode", in UTF8 - which is just one of several encodings of the Unicode character set.
KutuluMike about 10 years

This answer isn't even accurate. In UTF-16 encoded Unicode (like C# and Java use) most characters, including the original ASCII set, take up 2 bytes, which very obscure characters take up 4.
Darryl Braaten about 10 years

@ntoskrnl That is correct, but if you are looking in the drop box for encoding it says ANSI, which is why I mentioned if you have a different OEM codepage you may get different results.
user about 10 years

I don't think UTF-8 was so much designed for efficiency with ASCII data as for compatibility. UTF-8 has the very nice property that 7-bit ASCII content (with the high bit set to zero) is identical to the same content encoded as UTF-8, so for tools that normally deal with ASCII, it's a drop-in replacement. No other Unicode encoding scheme has that property, to my knowledge. UTF-8 is also reasonably compact for most data, particularly if you stay within the realm of the Unicode BMP.
Jukka K. Korpela about 10 years

@MichaelKjörling, I’ve added a reference to that feature. However, a major objection to Unicode in the early days was inefficiency, and UTF-16 doubles the size of data that is dominantly Ascii. UTF-8 means, e.g. for English text, that you only “pay” for the non-Ascii characters you use.
Damon about 10 years

The paragraph about Unicode is wrong, though. Unicode is not a solution to the existence of different terminals. On the contrary, Unicode is entirely unsuitable for what it is being used for. It is is not a character set, but a grapheme encoding (for "anything man has ever written"), which includes graphemes of languages that no living person speaks and multiple ambiguous encodings for the same graphemes. This makes it an extremely poor choice for computer text processing, introducing many twists and pitfalls, and significant overhead (for such things as e.g. "normalization").
Doktoro Reichard about 10 years

@Damon it is a solution. I never said it was the best.
Milind R over 9 years

@Damon Can you elaborate on why it's a very poor choice for computer text processing.
Damon over 9 years

@MilindR: hard to fit into 600 chars... Unicode contains a lot of crap that nobody will ever need (do you speak Babylonian?), and it encodes a lot of crap that nobody will seriously need (Klingon, really? Numbers in circles? ANSI control codes?), some of which are in low numbers, making UTF-8 considerably less efficient for roman languages than it could be (at no extra cost). Also, it allows a considerable number of symbols being encoded in two or more ways (e.g. accented/umlauted characters). This requires considerable work ("normalization") that would actually not be necessary.
Damon over 9 years

Unicode makes the assumption that you may need every character that any human since the stone age has ever drawn at any time, as special characters. 2/3 of that could be solved easier, better, and more efficiently by using a different font or formatting hints (like, numbers in circles, or superscript numbers). It certainly "works", somehow, but it's wrong-headed on so many ends.
Milind R over 9 years

@Damon normalization is the natural consequence of an evolving standard. Numbers in circles, superscript numbers... well, I have to agree with you. Still, it seems UTF-8 is more at fault than anything else.. The code points aren't unworthy of existing, just unworthy of precious 7-bit space. On that note : programmers.stackexchange.com/questions/266292/…
n611x007 over 7 years

-1 for me for sounding sloppy withfirst ... in mainstream (sorry my bad mood... hope we can do better)