Difference between UTF-8 and UTF-16?

127,514

Solution 1

I believe there are a lot of good articles about this around the Web, but here is a short summary.

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

  • Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  • No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
  • UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

Main UTF-8 cons:

  • Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
  • Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

Main UTF-16 pros:

  • BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
  • Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.

Main UTF-16 cons:

  • Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
  • Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
  • It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.

Solution 2

They're simply different schemes for representing Unicode characters.

Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.

UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.

For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.

See this page for more about UTF-8 and Unicode.

(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)

Solution 3

Security: Use only UTF-8

Difference between UTF-8 and UTF-16? Why do we need these?

There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.

WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.

The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.

Other groups are saying the same.

So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.

Solution 4

This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.

static byte[] toBytes(String s){
        byte[] b=new byte[s.length()*2];
        ByteBuffer.wrap(b).asCharBuffer().put(s);
        return b;
    }
Share:
127,514
theJava
Author by

theJava

Updated on October 18, 2020

Comments

  • theJava
    theJava about 3 years

    Difference between UTF-8 and UTF-16? Why do we need these?

    MessageDigest md = MessageDigest.getInstance("SHA-256");
    String text = "This is some text";
    
    md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
    byte[] digest = md.digest();
    
  • bestsss
    bestsss almost 13 years
    Missing only BE/LE part on UTF16 :) UTF-8 has another downside, it may generate longer output than UTF16
  • Sergei Tachenov
    Sergei Tachenov almost 13 years
    Yes, I forgot about BE/LE. It's not a big deal, though, especially for in-memory use. UTF-8 will generate longer output only if three-byte characters are involved, but that means mostly Chinese and Japanese. On the other hand, if the text contains a lot of US-ASCII characters, it may generate shorter output, so whether it is a downside or not depends on a particular situation.
  • bestsss
    bestsss almost 13 years
    I didn't even think of mentioning the immediate pro of utf-8, shorter length. About the longer output of utf-8 it was 'may' for a reason, yet if the target is far east, the default encoding should be utf-16. As for the example md.update(text.getBytes("UTF-8")); the encoding doesn't matter since the hash is stable both ways.
  • bestsss
    bestsss almost 13 years
    The fastest way to convert String to byte array is something like that, posted down as sample
  • nicky_zs
    nicky_zs over 9 years
    You say characters have different length in UTF-8 so it slows down indexing and calculating length, but I doubt about that characters in UTF-16 have different length too, should indexing and calculating length of UTF-16 be faster?
  • Sergei Tachenov
    Sergei Tachenov over 9 years
    @nicky, read carefully - I was talking about 16-bit subset of UTF-16 which is usually more than enough for many applications. If it's not enough, then extra characters can be represented with surrogate pairs, which screws up string length and makes indexes wrong, but at least you never end up in the middle of a regular (16-bit) character this way, which is impossible to guarantee with UTF-8.
  • Gaurang Patel
    Gaurang Patel over 6 years
    @Sergey Tachenov >In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols. Great summary line.! Thanks.
  • Gaurang Patel
    Gaurang Patel over 6 years
    @SergeyTachenov Can you shed some light on differences between in-memory representation and on storage (like hard drive) representation (files) in terms of encoding?
  • Sergei Tachenov
    Sergei Tachenov over 6 years
    @GaurangPatel, well, it kind of follows from the pros/cons, but I have edited the answer anyway, trying to explain the reasons.