does (w)ifstream support different encodings

c++ unicode stl character-encoding wifstream

21,811

Solution 1

C++ supports character encodings by means of std::locale and the facet std::codecvt. The general idea is that a locale object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facets, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream or write to a ostream, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.

However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.

iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)

The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):

typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

...

std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }

To understand more about locales, and how they use facets (including codecvt), take a look at the following:

Nathan Myers has a thorough explanation of locales and facets. Myers was one of the designers of the locale concept. He has more formal documentation if you want to wade through it.
Apache's Standard Library implementation (formerly RogueWave's) has a full list of facets.
Nicolai Josuttis' The C++ Standard Library Chapter 14 is devoted to the subject.
Angelika Langer and Klaus Kreft's Standard C++ IOStreams and Locales devotes a whole book.

Solution 2

ifstream does not care about encoding of file. It just reads chars(bytes) from file. wifstream reads wide bytes(wchar_t), but it still doesn't know anything about file encoding. wifstream is good enough for UCS-2 — fixed-length character encoding for Unicode (each character represented with two bytes).

You could use IBM ICU library to deal with Unicode files.

The International Component for Unicode (ICU) is a mature, portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N), giving applications the same results on all platforms.

ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.

21,811

peterchen

Updated on July 09, 2022

Comments

peterchen almost 2 years

When I read a text file to a wide character string (std::wstring) using an wifstream, does the stream implementation support different encodings - i.e. can it be used to read e.g. ASCII, UTF-8, and UTF-16 files?

If not, what would I have to do?

(I need to read the entire file, if that makes a difference)
- Martin York over 14 years
  
  how to imbue streams with locale. stackoverflow.com/questions/207662/…
quark over 14 years

I think it's slightly more correct to say that ifstream abstracts over the encoding. It makes use of it through lower-level facilities: locales (for standard C++), and OS or library specific i18n functions. i.e. ifstream may not care, but you do care what it calls in this case.
Kirill V. Lyadvinsky over 14 years

locales has nothing to do with the encodings of Unicode. When you are setting the locale, you just give a hint to iostream how it should represent symbols on console. But you cannot detect encoding of the file. And it is impossible to distinguish ANSI from UTF-8 by using ifstream.
sbi over 14 years

Nice summary. You might want to add amazon.com/dp/0201183951 to your book list. It's the most thorough treatment of the issue I know.
quark over 14 years

sbi: Added the book to the list. Thanks for the nice link.