Can we switch between ASCII and Unicode

10,374

Solution 1

Java uses Unicode internally. Always. Actually, it uses UTF-16 most of the time, but that's too much detail for now.

It can not use ASCII internally (for a String for example). You can represent any String that can be represented in ASCII in Unicode, so that should not be a problem.

The only place where the platform comes into play is when Java has to choose an encoding when you didn't specify one. For example, when you create a FileWriter to write String values to a String: at that point Java needs to use an encoding to specify how the specific character should be mapped to bytes. If you don't specify one, then the default encoding of the platform is used. That default encoding is almost never ASCII. Most Linux platforms use UTF-8, Windows often uses some ISO-8859-* derivatives (or other culture-specific 8-bit encodings), but no current OS uses ASCII (simply because ASCII can't represent a lot of important characters).

In fact, pure ASCII is almost irrelevant these days: no one uses it. ASCII is only important as a common subset of the mapping of most 8-bit encodings (including UTF-8): the lower 128 Unicode codepoints map 1:1 to the numeric values 0-127 in many, many encodings. But pure ASCII (where the values 128-255 are undefined) is no longer in active use.

As a side note, Java 9 has an internal optimization called "compact strings" where Strings that contain only characters representable in Latin-1 use a single byte per character instead of 2. This optimization is very useful for all kinds of "computer speak" like XML and similar protocols where the majority of the text is in the ASCII range. But it's also fully transparent to the developer, as all that handling is done internally in the String class and will not be visible from the outside.

Solution 2

Unicode is a strict superset of ASCII (and Latin 1 for that matter), at least regarding the character set. Not so much for the actual encodings on the byte level. So there cannot be a language/environment that supports Unicode but not ASCII. What the sentence above means is that if you only deal with ASCII text it works all just fine because, as noted, Unicode is a superset of ASCII.

Also, to clear up a few of your misconceptions:

  1. “ASCII is 1 byte and Unicode is 2” — ASCII is a 7-bit code, that uses 1 byte for each character. Bytes and characters are therefore the same in ASCII (which is unfortunate, because ideally bytes are just data and text is in characters, but I digress). Unicode is a 21-bit code that defines a mapping of code points (numbers) to characters. How these numbers are represented varies depending on the encoding. There is UTF-32 which is a fixed-width encoding where each Unicode code point is represented as a 32-bit code unit. UTF-16 is what Java uses, which uses either two or four bytes (one or two code units) per code point. But that's 16 bits per code unit, not per code point or actual character (in the Unicode sense). Then there is UTF-8 which uses 8-bit code units and represents code points as either one, two, three or four code units.

  2. For Java at least the platform has no say whatsoever in whether it supports only ASCII or Unicode. Java always uses Unicode and chars represent UTF-16 code units (which can be half-characters), not code points (which would be characters) and are therefore a bit misleadingly named. What you're probably referring to is Unix' tradition of combining language, locale and preferred system encoding in a few environment variables. That is you can have a system where that preferred encoding specifies a legacy encoding and applications that blindly use that can have problems. That doesn't mean you cannot build an application that supports Unicode on such systems. iconv has to work somehow, after all.

Share:
10,374
shar
Author by

shar

Updated on June 18, 2022

Comments

  • shar
    shar almost 2 years

    I came across "char variable is in Unicode format, but adopts / maps well to ASCII also". What is the need to mention that? Of course ASCII is 1 byte and Unicode is 2. And Unicodeitself contains ASCII code in it (by default - its the standard). So are there some languages in which a char variable supports UNICODE but not ASCII?

    Also, the character format (Unicode/ASCII) is decided by the platform we use, right? (UNIX, Linux, Windows etc). So suppose my platform used ASCII, is it not possible to switch to Unicode or vice-versa?