Correctly reading a utf-16 text file into a string without external libraries?

22,776

Solution 1

When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

Solution 2

The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:

#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
    // open as a byte stream
    std::wifstream fin("text.txt", std::ios::binary);
    // apply BOM-sensitive UTF-16 facet
    fin.imbue(std::locale(fin.getloc(),
       new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
    // read     
    for(wchar_t c; fin.get(c); )
            std::cout << std::showbase << std::hex << c << '\n';
}

Solution 3

Edit:

So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.



The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.


You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.

Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)

  1. codecvt<char32_t,char,mbstate_t>
  2. codecvt<char16_t,char,mbstate_t>
  3. codecvt_utf8
  4. codecvt_utf16
  5. codecvt_utf8_utf16
  6. c32rtomb/mbrtoc32
  7. c16rtomb/mbrtoc16

And what each one does

  1. A codecvt facet that always converts between UTF-8 and UTF-32
  2. converts between UTF-8 and UTF-16
  3. converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
  4. converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
  5. converts between UTF-8 and UTF-16
  6. If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
  7. If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16

If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).

Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.


So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.

This should build and run anywhere, but makes a bunch of assumptions to actually work:

#include <fstream>
#include <sstream>
#include <iostream>

int main ()
{
    std::stringstream ss;
    std::ifstream fin("filename");
    ss << fin.rdbuf(); // dump file contents into a stringstream
    std::string const &s = ss.str();
    if (s.size()%sizeof(wchar_t) != 0)
    {
        std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
        return 1;
    }
    std::wstring ws;
    ws.resize(s.size()/sizeof(wchar_t));
    std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}

You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.

Share:
22,776
neminem
Author by

neminem

Me codemonkey. Me C# good! Codemonkey like cheetos! (But not, surprisingly enough, mountain dew.)

Updated on July 10, 2022

Comments

  • neminem
    neminem almost 2 years

    I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:

    I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.

    I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?

    edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.

    edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?

    edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.

  • bames53
    bames53 about 12 years
    On platforms with a two byte wchar_t like Windows this will convert from UTF-16 to UCS-2. Specifically the VS2010 implementation truncates characters outside the BMP.
  • Cubbi
    Cubbi about 12 years
    @bames53 Indeed.. VS2010 reads those characters into char32_t correctly, but there's not a lot that can be done with a UCS4 string on Windows. It's probably too early to get rid of compiler-dependent stuff like _O_U16TEXT.
  • neminem
    neminem about 12 years
    Annoying, I tried your snippet, and while at first I thought it wasn't working (when I saw it print integers rather than unicode characters), then I noticed that was what it was supposed to be doing. I replaced the cout with appending to a wstring, and saw the unicode string I was expecting to see. I say "annoyingly" because I hadn't thought it was important to mention I'm stuck at vs2008 for this particular project, until now. (I have so edited my question.) This is still a correct answer, though, assuming you're allowed to use C++11. Or barring characters outside the BMP it is, anyway.
  • neminem
    neminem about 12 years
    Well, turns out, your code helped me debug - it stopped reading in exactly the same place in my sample text file as the code I linked to - (cfc.kizzx2.com/index.php/… - did. Turns out it wasn't stopping at a Chinese character, it stopped reading at the first instance of a : (FULLWIDTH COLON, U+FF1A) character. Removing that, it then stops at ) (FULLWIDTH RIGHT PARENTHESIS, U+FF09). I'm sensing a theme...
  • bames53
    bames53 about 12 years
    @neminem I guess I should have looked more closely at that link, it's just doing the same thing as I show. I'm guessing that for whatever reason, the VS 2008 implementation of fstream does not like reading the byte 0xFF. That byte represents 'delete'. Try opening the file in binary mode std::ifstream fin("...",std::ios::binary);
  • neminem
    neminem about 12 years
    Oh my frelling god. I spent over a day trying to figure it out, and it was that obvious? I tried -other- things that involved opening the file in binary mode, but I never tried the -original- solution only opening it in binary mode? You win so much. You should edit that into your solution, in case other people stumble on this question later (I can't imagine I'm the only person who's ever had this issue) :).
  • Mark Ransom
    Mark Ransom about 12 years
    It's not a bug - see my answer.
  • bames53
    bames53 about 12 years
    @MarkRansom That makes sense, though I'd have expected it to only have an effect on Windows when 0x0D and 0x0A appear together. The 0x1A seems like a bug by design, but since none of this stuff is standardized it's probably best to never use text mode anywhere.
  • NoSenseEtAl
    NoSenseEtAl almost 12 years
    do you know how writing back to file goes? I try : std::wofstream wofs("/utf16dump.txt"); wofs.imbue(std::locale(wofs.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>)); wofs << ws; and I get garbage
  • Cubbi
    Cubbi almost 12 years
    @NoSenseEtAl works for me, produces UTF-16be, as requested (using clang++/libcxx). Perhaps you needed std::little_endian?
  • Eugene
    Eugene about 11 years
    std::consume_header doesn't seem to work in VS2010 -- BOM is consumed, but byte order is not affected. I had to explicitly use std::little_endian too.
  • hkBattousai
    hkBattousai about 8 years
    Why do you open the file in the binary mode?
  • Cubbi
    Cubbi about 8 years
    @hkBattousai because I don't want the read to terminate if it runs into \x1a. Windows is crazy like that.
  • zar
    zar over 4 years
    For readers, replace the last line with std::wcout << c << '\n'; to see Unicode characters output.
  • bfx
    bfx almost 4 years
    Note that on macOS I had to explicitly set std::little_endian instead of std::consume_header for a file encoded as UTF-16 LE that included the respective BOM. Otherwise I would receive big endian output.
  • Chris Guzak
    Chris Guzak almost 3 years
    MSVC's version say this use of std::codecvt is deprecated in C++ 17, see _CXX17_DEPRECATE_CODECVT_HEADER. I don't see this mentioned here: en.cppreference.com/w/cpp/locale/codecvt
  • Cubbi
    Cubbi almost 3 years
    @ChrisGuzak std::codecvt was not deprecated. The codecvt header and its contents were - cppreference notes that on en.cppreference.com/w/cpp/… and individual pages