Read Unicode UTF-8 file into wstring

c++ file unicode utf-8 wstring

77,558

Solution 1

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.

In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

which can be used like this:

std::wstring wstr = readFile("a.txt");

Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));

Solution 2

According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.

Here is another pure C++ solution that works at least with VC++ 2010:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

Solution 3

Here's a platform-specific function for Windows only:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

Use like so:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

Note the entire file is loaded in to memory, so you might not want to use it for very large files.

Solution 4

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}

View more solutions

77,558

Abdelwahed

Updated on July 09, 2022

Comments

Abdelwahed almost 2 years

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?
- dan04 over 13 years
  
  By "Unicode" do you mean UTF-8 or UTF-16? And what platform are you using?
- Nawaz over 13 years
  
  Read this article : Reading UTF-8 with C++ streams
- Nawaz over 13 years
  
  Another good article : UTF-8 with C++ in a Portable Way
- anno over 13 years
  
  On windows, you should use std::string for UTF-8 and std::wstring for UTF-16.
user1703401 over 13 years

Might as well go the whole way: _wfopen(filename.c_str(), L"rt, ccs=UTF-8"); Conversion is now automatic.
David Heffernan over 13 years

I think you can use wstring with UTF-16
AshleysBrain over 13 years

Actually, rolled it back, docs on the _wfopen say it converts to wide characters automatically, and this code doesn't take that in to account.
ThomasMcLeod over 13 years

@Daivd: Actually you are incorrect, and this is a common misunderstanding. UTF-16 covers 1,112,064 code points from 0 to 0x10FFFF. The scheme requires a variable length storage of either one or two 16-bit words, whereas UCS-2 was strictly one 16-bit word. If you trace back the definition wchar_t, you will find that it is has as it's root a primative type of 16-bits (usually a short).
user1703401 over 13 years

Only the filename. Quote: Simply using _wfopen has no effect on the coded character set used in the file stream.
Philipp over 13 years

@David: Technically, a wstring is just an array of 16-bit integers on Windows. You can store UCS-2 or UTF-16 data or whatever you like in it. Most Windows APIs do accept UTF-16 strings nowadays.
David Heffernan over 13 years

@Philip I thought all Windows APIs are UTF-16 now. Which ones take UCS-2?
David Heffernan over 13 years

@Thomas I'm afraid the misunderstanding is on you. I know about variable length of UTF-16 and surrogate pairs. But that is perfectly compatible with wstring. A surrogate pair takes 2 wchar_t elements.
ThomasMcLeod over 13 years

@Philipp: you can store a subset of UTF-16 characters in a wstring. For example, you cannot store the Balinese script characters in a wstring, but there are valid UTF-16 encodings for these characters. en.wikipedia.org/wiki/Balinese_script
David Heffernan over 13 years

@Thomas that's not correct. UTF-16 uses 16 bit code units, i.e. a wchar_t on Windows.
Philipp over 13 years

@Thomas I have to agree with David. You can store any Unicode code point in a wstring if you treat it as an UTF-16 string. Non-BMP code points will need two code units, but there's nothing wrong with that.
ThomasMcLeod over 13 years

@Philipp: scatch my previous. I meant to refer to the Brāhmī script, which is even more obscure
Philipp over 13 years

@David: I think (but I'm not sure, I'm not using Windows right now) that the console still doesn't handle non-BMP characters. It is debatable whether that has something to do with the API itself.
David Heffernan over 13 years

@Thomas anything with a defined Unicode code point can be represented in UTF-16
David Heffernan over 13 years

@Philipp the console is a whole world of pain! Even getting it to display non ANSI code points is an exercise of extreme masochism!
Philipp over 13 years

@David: No, it's two lines, see blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx
AshleysBrain over 13 years

Are you sure? The way I interpreted the docs, specifying t in the mode as well as ccs=UTF-8 causes characters to be converted as they are read to and from the stream.
David Heffernan over 13 years

@Philipp Very interesting! I'm used to Python on Windows which has rubbish console support.
ThomasMcLeod over 13 years

@David: We seem to be arguing about semantics. You said "I think in can use wstring with UTF-16." That means more than store. It means store and have it interpreted correctly by at least stdio. I just tried using SMP characters with wcout and a wstring on Windows 7 pro 64-bit, and got a whole lot of gibberish.
David Heffernan over 13 years

@Thomas That doesn't mean the problem is with wstring.
Philipp over 13 years

@Ashley: Yes, the quote refers to using _wfopen without the ccs= mode specifier. You need both _wfopen (according to the manual _wfopen_s is to be preferred) and ccs=UTF-8.
Philipp over 13 years

@David I think that's a Python problem, not a Windows problem. I know the Python devs try hard to get Unicode support everywhere, but I think it's hard to bring the actual Windows semantics to a model that assumes that operating system streams are always byte-based and encoding-agnostic (that is true for Unix file and console streams and for Windows file streams, but not for the Windows console). I haven't studied the Python source code, but I think that at least some time in the past they assumed this model to hold.
David Heffernan over 13 years

@Philipp It's just a real shame that the Windows console feels a little neglected.
Philipp over 13 years

@Thomas: I don't think the MSVC++ iostreams library does any kind of Unicode except allowing Unicode file names. All solutions for using Unicode in C++ are effectively pure C solutions, either using the Windows API directly or using nonstandard extensions to the C library.
ThomasMcLeod over 13 years

@Philipp, I agree. That's why I say that wstring is UCS-2 and not UTF-16.
ThomasMcLeod over 13 years

@David: the problem is not with wstring storage, it's with typical wstring usage and UTF-16. Can can store UTF-16 in a bitset if you want, but is that using it with UTF-16? Not really.
David Heffernan over 13 years

@thomas what would you use instead of wstring?
Philipp over 13 years

@Thomas: The MSVC++ standard library doesn't support UCS-2 either. Last time I checked, the C++ locales didn't support any Unicode locale, making Unicode output essentially impossible.
Philipp over 13 years

Correction: The MSVC++ library does support UTF-16 and UTF-32 for the types char16_t and char32_t, that would essentially solve the issue for file I/O.
ThomasMcLeod over 13 years

@David: There's no good answer. What to use I guess depends on framework, platform, specific I/O requirements, etc. In general, if one must support non-BMP, char32_t and UTF-32 seems safer.
David Heffernan over 13 years

@Thomas No the question is what you use instead of wstring for UTF-16
ThomasMcLeod over 13 years

@David, convert it to UTF-32, then use string<char32_t>. Or, in .Net use system.text.UTF32Encoding
ThomasMcLeod over 13 years

@David, unless, of course, you can guarentee BMP, then there's no issue.
David Heffernan over 13 years

@thomas have you heard of surrogate pairs? UTF-16 is designed to be used with 16 code units. Outside BMP is fine. Are you aware that UTF-16 can encode all Unicode code points?
ThomasMcLeod over 13 years

@David, yes I'm aware. The problem is that many APIs that use wstrings don't know the difference. They interpret surrogate pairs as two 16-bit codes points. But since the surrogate pairs are in the invalid range of the BMP, they are ignored.
David Heffernan over 13 years

@thomas that would be a criticism of the API but your original point is that wstring is no good for storing UTF-16. Anyway which APIs are you referring to. I'm curious to know which ones don't support Unicode.
AshleysBrain over 12 years

Late edit in August: turns out @Hans Passant's way is better - edited the answer to use that instead!
Mikhail over 10 years

Why don't you delete converter?
sven over 8 years

"Overload 7 is typically called with its second argument, f, obtained directly from a new-expression: the locale is responsible for calling the matching delete from its own destructor." link
Dmitri Nesteruk over 7 years

Does that new codecvt_utf8 require a corresponding delete?
adprocas over 7 years

This works well. Curious, as I can't find a lot of info on it, and mine works fine without it, what is stream.imbue doing exactly? It seems as though it is setting some type of default type, but is this needed? Also, for first line remark, put your getline in a while(getline(stream, line)) loop to see more than the first line.
MrTux over 7 years

No neet to explicitly delete codecvt_utf8. This is done in the destructor of std::locale when the refcounter of codecvt_utf8 becomes zero (see en.cppreference.com/w/cpp/locale/locale/%7Elocale)
wp78de over 6 years

Hi. Thanks for sharing. Appreciated. Can you add a bit more context? Why this answer to an 6 years old questions. Thanks.
Shen Yu over 6 years

I have the some question recently, but I have solved now, I want to share my solution to help others.
wp78de over 6 years

That's nice. But how is your answer different from @LihO's answer? You just use a different locale, right?
Felipe Valdes about 5 years

For those using this answer, std::locale::empty() has a problem on clang: error: no member named 'empty' in 'std::__1::locale'.
Peter L over 4 years

Didn't work for me. Ended up using <codecvt> from @LihO
Bob Kline over 3 years

Sadly, all of the useful parts of codecvt have been deprecated in C++20.
ChrisW almost 3 years

I think it won't work -- the file contains UTF-8 not a sequence of wchar_t.