Unicode, Unicode Big Endian or UTF-8? What is the difference? Which format is better?

32,535

Solution 1

Dunno. Which is better: a saw or a hammer? :-)

Unicode isn't UTF

There's a bit in the article that's a bit more relevant to the subject at hand though:

  • UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:

“Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings

  • UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.

  • UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).

Also see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Solution 2

For European languages, UTF-8 is smaller. For Oriental languages, the difference is not so clear-cut.

Both will handle all possible Unicode characters, so it should make no difference in compatibility.

Solution 3

There are more Unicode character encodings than you may think.

  • UTF 8

    The UTF-8 encoding is variable-width, ranging from 1-4 bytes, with the upper bits of each byte reserved as control bits. The leading bits of the first byte indicate the total number of bytes used for that character. The scalar value of a character's code point is the concatenation of the non-control bits. In this table, x represents the lowest 8 bits of the Unicode value, y represents the next higher 8 bits, and z represents the bits higher than that.

    Unicode              Byte1     Byte2     Byte3     Byte4
    U+0000-U+007F       0xxxxxxx            
    U+0080-U+07FF       110yyyxx  10xxxxxx          
    U+0800-U+FFFF       1110yyyy  10yyyyxx  10xxxxxx    
    U+10000-U+10FFFF    11110zzz  10zzyyyy  10yyyyxx  10xxxxxx
    
  • UCS-16
  • UCS-16BE
  • UCS-16LE

  • UTF-16
  • UTF-16BE
  • UTF-16LE

  • UTF-32
  • UTF-32-BE

Solution 4

"Unicode" is another term for "UTF-16", which is an encoding of the Unicode character set into sixteen-bits per character. UTF-8 encodes it into eight bits per character.

In both cases, any overflow is allocated to another 16 or eight bits.

Solution 5

The only real advantage with small files like text files is the resulting file size. UTF-8 generally produces smaller files. But this difference may be less pronounced with Chinese/Japanese text.

Share:
32,535

Related videos on Youtube

Ben Turner
Author by

Ben Turner

I work with GPUs on deep learning and computer vision.

Updated on September 17, 2022

Comments

  • Ben Turner
    Ben Turner over 1 year

    When I try to save a text file with non-English text in Notepad, I get an option to choose between Unicode, Unicode Big Endian and UTF-8. What is the difference between these formats?

    Assuming I do not want any backward compatibility (with older OS versions or apps) and I do not care about the file size, which of these formats is better?

    (Assume that the text can be in languages like Chinese or Japanese, in addition to other languages.)

    Note: From the answers and comments below it seems that in Notepad lingo, Unicode is UTF-16 (Little Endian), Unicode Big Endian is UTF-16 (Big Endian) and UTF-8 is well UTF-8.

  • steve
    steve almost 15 years
    Which one is better then?
  • juanefren
    juanefren almost 15 years
    The problem comes from the fact that Unicode is an 'encoding', but not in the numbers-into-bytes sense. UTF-8/16/32 are all Unicode encodings, but Unicode itself is a mapping from symbols to numbers. They could have used more unique terminology to avoid this confusion I think.
  • juanefren
    juanefren almost 15 years
    Regardless though, to the OP of the question, odds are that the application means 'UTF-16' where it says 'Unicode'.
  • Jason Baker
    Jason Baker almost 15 years
    Bear in mind that there's also a difference in network bandwidth and memory usage.
  • John Saunders
    John Saunders almost 15 years
    "it depends" on the situation.
  • Mr. Shiny and New 安宇
    Mr. Shiny and New 安宇 over 14 years
    I'm not sure that UTF-8's goal is "conservation" as opposed to backwards-compatibility with ASCII.
  • Arjan
    Arjan over 14 years
    Though for this specific question it seems that "Unicode" is indeed ABUSED as another term for "UTF-16", it's not so in general -- see Jason's answer.
  • sleske
    sleske over 14 years
    "UTF-8 generally produces smaller files": Not generally. UTF-8 produces smaller files for ASCII files. If a file only consists of Unicode codepoints above U+0800, it will be larger in UTF-8 than in UTF-16.
  • peyman khalili
    peyman khalili over 13 years
    @Johannes: The Unicode Consortium has decided never to assign code points above U+10FFFF because they cannot be represented in UTF-16. This had the effect of restricting UTF-8 to 4 bytes.
  • peyman khalili
    peyman khalili over 13 years
    You mean "per code unit", not "per character"; both UTF-8 and UTF-16 can use multiple code units to represent a character. And "Unicode" an "UTF-16" are NOT the same thing, except in Microsoft terminology.
  • David Richerby
    David Richerby over 8 years
    And the difference is...?
  • phuclv
    phuclv over 8 years
    There are more Unicode character encodings than you listed. For example UTF-1, UTF-7, UTF-EBCDIC, GB-18030, MIME, UTF-9 and UTF-18... You can also use any binary encoding scheme to encode Unicode data. Read more Comparison of Unicode encodings
  • Pacerier
    Pacerier almost 7 years
    @Jason, is Joel actually really racist?