What would break if the C locale was UTF-8 instead of ASCII?

character-encoding locale posix unicode compatibility

7,074

Solution 1

The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.

If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).

Solution 2

You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.

It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.

However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:

I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C

Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.

The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.

You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.

The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.

So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).

7,074

Infinity James

Updated on September 18, 2022

Comments

Infinity James over 1 year

The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.

What would happen if the encoding of C were switched to UTF-8 instead?

The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.
- Admin about 11 years
  
  This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
- Admin almost 6 years
  
  FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
Infinity James about 11 years

The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
goldilocks about 11 years

Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
goldilocks about 11 years

Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.h‌tml "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
Gilles 'SO- stop being evil' about 11 years

@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
user about 11 years

@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
goldilocks about 11 years

@MichaelKjörling : Point taken. That was exasperation; not sure if I did a good job of dispelling anyone's confusion about locale. :/