Unicode, UTF, ASCII, ANSI format differences
322,334
Solution 1
Going down your list:
- "Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
- UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
- UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
- UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
-
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET
Utf32String
class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.) - ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
- ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
There's more on my Unicode page and tips for debugging Unicode problems.
The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.
Solution 2
Some reading to get you started on character encodings: Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By the way - ASP.NET has nothing to do with it. Encodings are universal.
Comments
-
web dunia almost 2 years
What is the difference between the
Unicode
,UTF8
,UTF7
,UTF16
,UTF32
,ASCII
, andANSI
encodings?In what way are these helpful for programmers?
-
Doug Moore almost 12 yearsI actually think of ANSI as Code Page 437, as that was what ANSI Art used. However, I don't think that is available in ASP.Net
-
Keith Thompson almost 9 yearsThe term "ANSI" when applied to Microsoft's 8-bit code pages is a misnomer. They were based on drafts submitted for ANSI standardization, but ANSI itself never standardized them. Windows-1252 (the code page most commonly referred to as "ANSI") is similar to ISO 8859-1 (Latin-1), except that Windows-1252 has printable characters in the range 0x80..0x9F, where ISO 8859-1 has control characters in that range. Unicode also has control characters in that range. en.wikipedia.org/wiki/Windows_code_page
-
jp2code over 8 years@JonSkeet, I have some web pages that send email messages. Currently they use UTF8. Should I be thinking about changing them back to UTF7?
-
Jon Skeet over 8 years@jp2code: I wouldn't - but you need to distinguish between "content that is sent back via HTTP from the web server" and "content that is sent via email". It's not the web page content that sends the email - it's the app behind it, presumably. The web content would be best in UTF-8; the mail content could be in UTF-7, although I suspect that it's fine to keep that in UTF-8 these days.
-
tripleee over 8 yearsAs the question no longer mentions ASP.NET anywhere (after edits done quite some time ago), I refactored the answer to be similarly platform-agnostic. In particular, the comments above re: UTF-16 != Unicode no longer make a lot of sense.
-
tripleee over 8 yearsUTF-7 is mandated by e.g. IMAP as a protocol-level encoding for some things, but there is no reason to use it where you get to choose the encoding yourself. In email, more and more systems just use
charset="utf-8"
in the email body, possibly withContent-Transfer-Encoding: quoted-printable
or evenbase64
to ensure that the encoded email is 7-bit clean. In limited systems where you know everything is 8-bit clean, there is no need for that, of course. -
Ludovic Kuty over 8 yearsFor UTF-16, IMHO, I would say "2 bytes per code unit" since a code point outside the BMP will be encoded in surrogate pairs as 2 code units (4 bytes).
-
Maarten Bodewes about 8 yearsMisses the differences between UTF-16LE (within .NET) and BE as well as the notion of the BOM.
-
Nick Sotiros over 7 yearsThe U in UTF stands for Unicode. UTF stands for Unicode Transformation Format, so all UTF is some type (encoding) of unicode.
-
Dave Knise almost 7 yearsAnswered here 6 years after the article was written. I read it 8 years after the post was written. 14 years later and it's still a good read. That's more than half my life ago. Incredible.
-
Andrew over 6 yearsIs there any difference between an ASCII and a WP-1252 encoded file if only ASCII chars are present? Once extended chars are introduced into the file that can't be displayed in ASCII, is a BOM added to the file to clearly identify it as WP-1252, or is just the MSB of extended chars relied on for identification?
-
Jon Skeet over 6 years@Andrew: No, there's no (general) encoding marker. Windows 1252 can't represent the Unicode BOM, and it wouldn't make sense as it's only a one-byte-per-char encoding anyway.
-
MrWatson over 4 years@JonSkeet : I think it is time to correct the comment that UTF-16 characters outside the BMP are "relatively rarely used" ... Thanks to the #Emojiplosion of recent years, we all need to get savvy how to deal with "multi-word" UTF-16!
-
Jon Skeet over 4 years@MrWatson: Yup, will do.
-
MrWatson over 4 years@JonSkeet - you get +100 from me for the enlightening comment about ANSI ... This term has been confounding me for YEARS and YEARS and multiple searches in the internet have not enlightened me - till now!
-
vulcan raven over 3 yearsAnother similar useful resource: youtube.com/watch?v=MijmeoH9LT4