How to read UTF-8 encoded text file using std::ifstream?
Encoding "ABC가나다" using UTF-8 should give you
"\x41\x42\x43\xEA\xB0\x80\xEB\x82\x98\xEB\x8B\xA4"
so the content of the file you got is correct. The problems is with your source file encoding. You are not allowed to use non-ascii symbols in string literals like that, you should prefix them with u8 to get UTF-8 literal:
u8"ABC가나다"
At this point I assume you are using Windows, otherwise you wouldn't have any issues with encodings. You will have to change your terminals character set to UTF-8:
chcp 65001
What is happening in your case is that you are reading UTF-8 text from a file to a string, then printing it to non-unicode terminal which is unable to show it as you expect. When you are printing your string literal, you are printing non-unicode sequence, but this sequences enconding matches your terminal encoding, so you can see what you expected.
PS: I used https://mothereff.in/utf-8 to get UTF-8 represenation of your string in hex.
JaeJun LEE
Updated on June 12, 2022Comments
-
JaeJun LEE over 1 year
I'm having a hard time to parse an xml file.
The file was saved with UTF-8 Encoding.
Normal ASCII are read correctly, but Korean characters are not.
So I made a simple program to read a UTF-8 text file and print the content.
Text File(test.txt)
ABC가나다
Test Program
#include <fstream> #include <iostream> #include <string> #include <iterator> #include <streambuf> const char* hex(char c) { const char REF[] = "0123456789ABCDEF"; static char output[3] = "XX"; output[0] = REF[0x0f & c>>4]; output[1] = REF[0x0f & c]; return output; } int main() { std::cout << "File(ifstream) : "; std::ifstream file("test.txt"); std::string buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>()); for (auto c : buffer) { std::cout << hex(c)<< " "; } std::cout << std::endl; std::cout << buffer << std::endl; //String literal std::string str = "ABC가나다"; std::cout << "String literal : "; for (auto c : str) { std::cout << hex(c) << " "; } std::cout << std::endl; std::cout << str << std::endl; return 0; }
Output
File(ifstream) : 41 42 43 EA B0 80 EB 82 98 EB 8B A4 ABC媛?섎떎 String literal : 41 42 43 B0 A1 B3 AA B4 D9 ABC가나다
The output said that characters are encoded differently in string literal and file.
So far as I know, in c++
char
strings are encoded in UTF-8 so we can see them throughprintf
orcout
. So their bytes were supposed to be same, but they were different actually...Is there any way to read UTF-8 text file using
std::ifstream
?
I succeed to parse xml file using
std::wifstream
following this article.But most of the libraries I'm using are supporting only
const char*
string so I'm searching for another way to usestd::ifstream
.And also I've read this article saying that do not use
wchar_t
. Treatingchar
string as multi-bytes character is sufficient.