Unicode, UTF, ASCII, ANSI format differences

322,334

Solution 1

Going down your list:

  • "Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
  • UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
  • UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
  • UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
  • UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
  • ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
  • ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

There's more on my Unicode page and tips for debugging Unicode problems.

The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

Solution 2

Some reading to get you started on character encodings: Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

By the way - ASP.NET has nothing to do with it. Encodings are universal.

Share:
322,334
web dunia
Author by

web dunia

Good person.

Updated on July 18, 2022

Comments

  • web dunia
    web dunia almost 2 years

    What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?

    In what way are these helpful for programmers?

  • Doug Moore
    Doug Moore almost 12 years
    I actually think of ANSI as Code Page 437, as that was what ANSI Art used. However, I don't think that is available in ASP.Net
  • Keith Thompson
    Keith Thompson almost 9 years
    The term "ANSI" when applied to Microsoft's 8-bit code pages is a misnomer. They were based on drafts submitted for ANSI standardization, but ANSI itself never standardized them. Windows-1252 (the code page most commonly referred to as "ANSI") is similar to ISO 8859-1 (Latin-1), except that Windows-1252 has printable characters in the range 0x80..0x9F, where ISO 8859-1 has control characters in that range. Unicode also has control characters in that range. en.wikipedia.org/wiki/Windows_code_page
  • jp2code
    jp2code over 8 years
    @JonSkeet, I have some web pages that send email messages. Currently they use UTF8. Should I be thinking about changing them back to UTF7?
  • Jon Skeet
    Jon Skeet over 8 years
    @jp2code: I wouldn't - but you need to distinguish between "content that is sent back via HTTP from the web server" and "content that is sent via email". It's not the web page content that sends the email - it's the app behind it, presumably. The web content would be best in UTF-8; the mail content could be in UTF-7, although I suspect that it's fine to keep that in UTF-8 these days.
  • tripleee
    tripleee over 8 years
    As the question no longer mentions ASP.NET anywhere (after edits done quite some time ago), I refactored the answer to be similarly platform-agnostic. In particular, the comments above re: UTF-16 != Unicode no longer make a lot of sense.
  • tripleee
    tripleee over 8 years
    UTF-7 is mandated by e.g. IMAP as a protocol-level encoding for some things, but there is no reason to use it where you get to choose the encoding yourself. In email, more and more systems just use charset="utf-8" in the email body, possibly with Content-Transfer-Encoding: quoted-printable or even base64 to ensure that the encoded email is 7-bit clean. In limited systems where you know everything is 8-bit clean, there is no need for that, of course.
  • Ludovic Kuty
    Ludovic Kuty over 8 years
    For UTF-16, IMHO, I would say "2 bytes per code unit" since a code point outside the BMP will be encoded in surrogate pairs as 2 code units (4 bytes).
  • Maarten Bodewes
    Maarten Bodewes about 8 years
    Misses the differences between UTF-16LE (within .NET) and BE as well as the notion of the BOM.
  • Nick Sotiros
    Nick Sotiros over 7 years
    The U in UTF stands for Unicode. UTF stands for Unicode Transformation Format, so all UTF is some type (encoding) of unicode.
  • Dave Knise
    Dave Knise almost 7 years
    Answered here 6 years after the article was written. I read it 8 years after the post was written. 14 years later and it's still a good read. That's more than half my life ago. Incredible.
  • Andrew
    Andrew over 6 years
    Is there any difference between an ASCII and a WP-1252 encoded file if only ASCII chars are present? Once extended chars are introduced into the file that can't be displayed in ASCII, is a BOM added to the file to clearly identify it as WP-1252, or is just the MSB of extended chars relied on for identification?
  • Jon Skeet
    Jon Skeet over 6 years
    @Andrew: No, there's no (general) encoding marker. Windows 1252 can't represent the Unicode BOM, and it wouldn't make sense as it's only a one-byte-per-char encoding anyway.
  • MrWatson
    MrWatson over 4 years
    @JonSkeet : I think it is time to correct the comment that UTF-16 characters outside the BMP are "relatively rarely used" ... Thanks to the #Emojiplosion of recent years, we all need to get savvy how to deal with "multi-word" UTF-16!
  • Jon Skeet
    Jon Skeet over 4 years
    @MrWatson: Yup, will do.
  • MrWatson
    MrWatson over 4 years
    @JonSkeet - you get +100 from me for the enlightening comment about ANSI ... This term has been confounding me for YEARS and YEARS and multiple searches in the internet have not enlightened me - till now!
  • vulcan raven
    vulcan raven over 3 years
    Another similar useful resource: youtube.com/watch?v=MijmeoH9LT4