Is there such a thing as non-utf8 character
Solution 1
Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char
you mean an 8-bit byte, then the invalid UTF-8 code units would be char
values that do not appear in UTF-8 encoded text.
Solution 2
std::string
only knows about raw char
values, it knows nothing about particular character encodings that use char
to hold encoded values.
Many common UTF-8 implementations use char
to hold encoded codeunits (though C++20 will introduce char8_t
and std::u8string
for this purpose instead). But other character encodings (Windows-12##, ISO-8859-#, etc) can also fit their encoded values in char
elements, too.
Any char
value that falls within the ASCII range (0x00 .. 0x7F) will fit in 1 char
and map to the same codepoint value in Unicode (U+0000 .. U+007F), but any char
value in the ANSI range but not in the ASCII range (0x80 .. 0xFF) is subject to interpretation by whatever character encoding created the char
values. Some encodings use 1 char
per character, some use multiple char
s.
So yes, there is such a thing as a "non-UTF-8 char".
user643605
Updated on June 04, 2022Comments
-
user643605 almost 2 years
Trying to implement c++ code where we could use a non-utf8 char to be as delimiter inside a std::string.
Is there such a thing as a non-UTF-8 char ?
-
Anonymous Anonymous over 4 yearsBut the C++-Standard still requires char to have an size of exactely 1 byte. Assuming the standard 8 bit = 1 byte, any utf8-char will allways fit into
char
-
Remy Lebeau over 4 yearsA UTF-8 encoded codeunit can be made to fit in a
char
, yes. But UTF-8 is an 8-bit encoding, butchar
may be either signed or unsigned depending on compiler implementation. In case of signed, all codeunits of any Unicode codepoint above U+007F will occupy the sign bit of eachchar
. Also note that althoughchar
is guaranteed to be 1 byte in size, a byte is not guaranteed to be 8 bits on all platforms (though on most, it is) - seeCHAR_BIT
inlimits.h
. UTF-7, on the other hand, would fit nicely in achar
string without using the sign bit at all.