Read unicode file with special characters using std::wifstream

11,228

You have to use the imbue() method to tell wifstream that the file is encoded as UTF-16, and let it consume the BOM for you. You do not have to seekg() past the BOM manually. For example:

#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

// open as a byte stream
std::wifstream wif("myfile.txt", std::ios::binary);
if (wif.is_open())
{
    // apply BOM-sensitive UTF-16 facet
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));

    std::wstring wline;
    while (std::getline(wif, wline))
    {
        std::wstring convert;
        for (auto c : wline)
        {
            if (c != L'\0')
                convert += c;
        }
    }

    wif.close();
}
Share:
11,228
Jon Helt-Hansen
Author by

Jon Helt-Hansen

Updated on June 04, 2022

Comments

  • Jon Helt-Hansen
    Jon Helt-Hansen almost 2 years

    In a Linux environment, I have a piece of code for reading unicode files, similar as shown below.

    However, special characters (like danish letters æ, ø and å) are not handled correctly. For the line 'abcæøåabc' then output is simply 'abc'. Using a debugger I can see that the contents of wline is also only a\000b\000c\000.

    #include <fstream>
    #include <string>
    
    std::wifstream wif("myfile.txt");
    if (wif.is_open())
    {
        //set proper position compared to byteorder
        wif.seekg(2, std::ios::beg);
        std::wstring wline;
    
        while (wif.good())
        {
            std::getline(wif, wline);
            if (!wif.eof())
            {
                std::wstring convert;
                for (auto c : wline)
                {
                    if (c != '\0')
                    convert += c;
                }
            }
        }
    }
    wif.close();
    

    Can anyone tell me how I get it to read the whole line?

    Thanks and regards