Is it bad to have accented characters in c++ source code?

c++ unicode portability

10,452

Solution 1

The main issue using non-ASCII characters in C++ source is that the compiler must be aware of the encoding used for the source. If the source is 7-bit ASCII then it doesn't usually matter, since most all compilers assume an ASCII compatible encoding by default.

Also not all compilers are configurable as to the encoding, so two compilers might unconditionally use incompatible encodings, meaning that using non-ASCII characters can result in source code that can't be used with both.

GCC: has command-line options for setting the source, execution, and wide execution encodings. The defaults are set by the locale, which usually uses UTF-8 these days.
MSVC: uses so-called 'BOM' to determine source encoding (between UTF-16BE/LE, UTF-8, and the system locale encoding), and always uses the system locale as the execution encoding. edit: As of VS 2015 Update 2, MSVC supports compiler switches to control source and execution charsets, including support for UTF-8. see here
Clang: always uses UTF-8 as the source and execution encodings

So consider what happens with your code to search for an accented character if the string being searched is UTF-8 (perhaps because the execution character set is UTF-8). Whether the character literal 'é' works as you expect or not, you will not be finding accented characters because accented characters won't be represented by any single byte. Instead you'd have to search for various byte sequences.

There are different kinds of escapes which C++ allows in character and string literals. Universal Character Names allow you to designate a Unicode code point, and will be handled exactly as if that character appeared in the source. For example \u00E9 or \U000000E9.

_{(some other languages have \u to support codepoints up to U+FFFF, but lack C++'s support for codepoints beyond that or make you use surrogate code points. You cannot use surrogate codepoints in C++ and instead C++ has the \U variant to support all codepoints directly.)}

UCNs are also supposed to work outside of character and string literals. Outside such literals UCNs are restricted to characters not in the basic source character set. Until recently compilers didn't implement this (C++98) feature, however. Now Clang appears to have pretty complete support, MSVC seems to have at least partial support, and GCC purports to provide experimental support with the option -fextended-identifiers.

Recall that UCNs are supposed to be treated identically with the actual character appearing in the source; Thus compilers with good UCN identifier support also allow you to simply write the identifiers using the actual character so long as the compiler's source encoding supports the character in the first place.

C++ also supports hex escapes. These are \x followed by any number of hexadecimal digits. A hex escape will represent a single integral value, as though it were a single codepoint with that value, and no conversion to the execution charset is done on the value. If you need to represent a specific byte (or char16_t, or char32_t, or wchar_t) value independent of encodings, then this is what you want.

There are also octal escapes but they aren't as commonly useful as UCNs or hex escapes.

Here's the diagnosic that Clang shows when you use 'é' in a source file encoded with ISO-8859-1 or cp1252:

warning: illegal character encoding in character literal [-Winvalid-source-encoding]
    std::printf("%c\n",'<E9>');
                       ^

Clang issues this only as a warning and will just directly output a char object with the source byte's value. This is done for backwards compatibility with non-UTF-8 source code.

If you use UTF-8 encoded source then you get this:

error: character too large for enclosing character literal type
    std::printf("%c\n",'<U+00E9>');
                       ^

Clang detects that the UTF-8 encoding corresponds to the Unicode codepoint U+00E9, and that this code point is outside the range a single char can hold, and so reports an error. (Clang escapes the non-ascii character as well, because it determined that the console it was run under couldn't handle printing the non-ascii character).

Solution 2

Formally C++ supports a pretty good subset of Unicode even in identifiers, so in theory one could write identifiers with e.g. Norwegian characters, such as antallBlåbærsyltetøyGlass.

In practice, C++ implementations only support A trough Z, digits 0 through 9, and underscore, in identifiers. Some implementations also allow the dollar sign $. However, the standard does not allow the dollar sign.

To specify a Unicode character in a text literal, you can use a universal character name, which isn't a name at all but more like an escape sequence, e.g. \u20AC (the Euro sign €). You can also write such characters directly if you save your source code as UTF-8. Note that Visual C++ requires a BOM (Byte Order Mark) in order to recognize UTF-8 source code as such.

If you treat strings as UTF-8 encoded (i.e. char type, as is common in *nix) then an "é", which is outside the ASCII range 0...127, will not be a single char value, and thus can't be used as a case label in a switch.

However, this particular character is part of Latin-1, which is a subset of Windows ANSI Western, which is a one-byte-per-character encoding. So in a Western installation of Windows, using the ANSI encoding for string values, it is a single value and can be so used. Latin-1 is also a subset of Unicode (comprising the first 256 code points of Unicode), so with wchar_t based strings, e.g. std::wstring, and with those wide strings as Unicode, "é" is also a single value, namely the same value as in Latin-1 and in Windows ANSI Western.

Still, using wchar_t to represent Unicode is no guarantee that any arbitrary character will be a single value.

For example, in Windows a wchar_t is just 16 bits and the standard encoding is UTF-16, where characters outside the so called Basic Multilingual Plane (the original 16-bit Unicode) are represented with two values called a surrogate pair. Even worse, even with UTF-32 Unicode allows for accented characters being represented with two or more values, namely first a value representing the sort of basic character and then values that modify it by adding accent marks etc., so for full generality you can not rely on characters being single values even with 32-bit wchar_t.

Solution 3

Edit: To use a macro in a switch statement requires two changes to my original solution. First every character must fit in an integral type; the best way to ensure this is to use wide characters with wchar_t. Secondly the macro must be a character literal instead of a string literal. E.G.

#define E_GRAVE L'\u00E8'

wchar_t someChar = ...;
switch(someChar)
{
   case E_GRAVE :
        x = 1;
        break;
   ...
}

One totally portable way is to define macros for the accented characters and rely on string concatenation.

// è (U+00E8) in UTF-8 encoding
#define E_GRAVE "\xC3\xA8"

cout << "Resum" E_GRAVE << endl;

This of course assumes that you are working with UTF-8. You can support any character set you want this way. Here's how you'd do it on Windows with UTF-16:

#define E_GRAVE L"\u00E8"

wchar_t * resume = L"Resum" E_GRAVE;

10,452

Celeritas

Hi.

Updated on June 21, 2022

Comments

Celeritas almost 2 years
I want my program to be as portable as possible. I search a string for accented characters, e.g. è. Could this be a problem? Is there a C++ equivalent of HTML entities?

It would be used in a switch statement, for example:
```
switch(someChar) //someChar is of type char
{
   case 'é' :
        x = 1;
        break;
   case 'è' :
   ...
}
```
- PlasmaHH over 11 years
  
  Is that really a character as in a char, or a character as in an utf8 sequence of chars?
- Mooing Duck over 11 years
  
  What is the type of someChar?
Mooing Duck over 11 years

I was unaware that MSVC limited the identifier names in that way, but I confirmed it: msdn.microsoft.com/en-us/library/565w213d.aspx How very sad :(
Mooing Duck over 11 years

There's no insurance that all characters can be held by a wchar_t. Many characters can't on Windows.
Mark Ransom over 11 years

@MooingDuck, if that's your situation then the use of a switch statement is going to be impossible. But I think it holds for the characters people care the most about.
Mooing Duck over 11 years

nonsense, int32_t is guaranteed to hold any unicode codepoint.
Celeritas over 11 years

According to cited question _t means type. So when you speak of int32 you mean anything declared as an int on a 32 bit system? What does w in wchar_t mean? stackoverflow.com/questions/231760/…
Mark Ransom over 11 years

@Celeritas the w stands for wide. It means a character type that's larger than the standard char, and is implementation-dependent. I think it's 16 bits on Windows and 32 bits on Linux.
edA-qa mort-ora-y over 11 years

One could use char32_t literals now, like Ué though the file encoding might still be an issue. Perhaps the escape sequence would be safer (but still as a char32_t literal)
edA-qa mort-ora-y over 11 years

How about char32_t now that we have unicode in the standard (compiler support notwithstanding)
bames53 over 11 years

@MooingDuck The C++ standard requires that a wchar_t can hold any character representable in any locale the implementation supports. But then as far as I know Windows doesn't support any locale that has any characters outside the BMP, so Windows' use of UTF-16 as the wchar_t encoding won't technically be non-conforming until they, for example, start supporting a UTF-8 locale. Or if they have some, say, east Asian locale that already uses a non-BMP character then they'd already be non-conforming.
Mooing Duck over 11 years

@edA-qamort-ora-y: As his last statement clarifies, the character might be encoded as two codepoints, in which case, not even a char32_t will hold it.
Mooing Duck over 11 years

@bames53: Windows has UTF8 and East Asian locales, so in this regard they are technically nonconforming. However, even if it were 32 bits, you'd still have problems with characters composed of multiple codepoints.
bames53 over 11 years

@MooingDuck Windows has some support for UTF-8, but no locale that uses it. wchar_t is only required to support characters that appear in locales, so it would only count if Windows supported, for example, English_United States.65001. Also the east-Asian locale was just a possibility; they could all be limited to just BMP Asian characters. I'd have to check the character charts to see.
Celeritas over 11 years

I googled but couldn't find a list of universal character names. Where did you find \U000000E9?
bames53 over 11 years

UCNs use the Unicode character's number (a.k.a. short name): LATIN SMALL LETTER E WITH ACUTE