Properly print utf8 characters in windows console

c++ utf-8 console mingw windows-xp-sp3

34,679

Solution 1

By default the wide print functions on Windows do not handle characters outside the ascii range.

There are a few ways to get Unicode data to the Windows console.

use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.

The problem with the third method is this:

putc('\302'); putc('\260'); // doesn't work with CP_UTF8
puts("\302\260"); // correctly writes UTF-8 data to Windows console with CP_UTF8

Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.

It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.

You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).

Solution 2

Another trick, instead of SetConsoleOutputCP, would be using _setmode on stdout:

// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>
int main() {
    _setmode(_fileno(stdout), _O_U16TEXT);  
    wchar_t * unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
    wprintf(L"%s", unicode_text);
    return 0;
}

Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);

Solution 3

//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
  SetConsoleOutputCP(65001);
  const char unicode_text[]="aäbcdefghijklmnoöpqrsßtuüvwxyz";
  printf("%s\n", unicode_text);
}

Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz

Solution 4

I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plain string literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler (proved by printing out their numeric values one byte at a time; the raw literal had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).

After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.

Solution 5

Console can be set to display UTF-8 chars: @vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.

To check the UTF-8 support, run

#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
  puts(cp);
  return true;
}
int main() {
  EnumSystemCodePages(showCPs,CP_SUPPORTED);
}

65001 should appear in the list.

Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (@Devenec suggests Lucida Console in his answer).

Why printf fails

As @bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.

If everything fails

Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:

const char* ucshift(const char* str, int len=1) {
  for(int i=0; i<len; ++i) {
    if(*str==0) return str;
    if(*str<0) {
      unsigned char c = *str;
      while((c<<=1)&128) ++str;
    }
    ++str;
  }
  return str;
}

...and this function to transform the bytes into unicode number:

int ucchar(const char* str) {
  if(!(*str&128)) return *str;
  unsigned char c = *str, bytes = 0;
  while((c<<=1)&128) ++bytes;
  int result = 0;
  for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
  int mask = 1;
  for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
  result|= (*str&mask)<<(6*bytes);
  return result;
}

Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)

or you can use your own mapping from Unicode table to your active working codepage. Example:

int main() {
  system("chcp 65001 > nul");
  char str[] = "příšerně"; // file saved in UTF-8
  for(const char* p=str; *p!=0; p=ucshift(p)) {
    int c = ucchar(p);
    if(c<128) printf("%c\n",c);
    else printf("%d\n",c);
  }
}

This should print

If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.

View more solutions

34,679

Author by

rsk82

Updated on July 09, 2022

Comments

rsk82 6 months

This is the way I try to do it:

#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
  SetConsoleOutputCP(CP_UTF8);
   //german chars won't appear
  char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
  int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
  wchar_t *unicode_text = new wchar_t[len];
  MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
  wprintf(L"%s", unicode_text);
}

And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.

So, what I'm doing wrong here ?

to WouterH:

int main() {
  SetConsoleOutputCP(CP_UTF8);
  const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
  wprintf(L"%s", unicode_text);
}

this also doesn't work. Effect is just the same. My font is of course Lucida Console.

third take:

#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT  0x20000
#include <fcntl.h>
using namespace std;
int main() {
    _setmode(_fileno(stdout), _O_U16TEXT);
    const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
    wprintf(L"%s", u_text);
}

ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.

rsk82 over 10 years

'_O_U16TEXT' was not declared in this scope - :(
huysentruitw over 10 years

it is defined like this #define _O_U16TEXT 0x20000 /* file mode is UTF16 no BOM (translated) */ in fcntl.h. But you should tell us more about which compiler you are using.
rsk82 over 10 years

I use MinGW (it is in tags to my question) and indeed, there is #define _O_U16TEXT 0x20000 definition in that file.
rsk82 over 10 years

Ok, I managed to compile this but the program returns: ańbcdefghijklmno÷pqrs▀tuŘvwxyz
bames53 over 10 years

@rsk82 to get this to work the wide character string must be correctly encoded. Verify that those non-ascii characters in the wide character string are getting the correct value at runtime. It looks like something is being misinterpreted as CP852.
ariestav over 7 years

This worked in my case after hassling with Windows for about two hours. Thank you!
jfs about 7 years

cp65001 has bugs e.g., putchar('\302'); putchar('\260'); fails but puts("\302\260"); works.
Slaus almost 7 years

It's not working. I get something like: "aÃ¤bcdefghijklmnoÃ¶pqrsÃŸtuÃ¼vwxyz".
Nicolas Raoul almost 7 years

I get "identifier UTF_8 is undefined" even with the same includes as you.
John Leidegren over 4 years

@Slav did you save the file with UTF-8 encoding and not the default ANSI encoding? It matters, you can ensure that the string is properly UTF-8 encoded regardless by using UTF-8 byte sequences, like "\xc3\x85".
Slaus over 4 years

@JohnLeidegren yes, it was UTF8. And I tried with BOM and without BOM.
John Leidegren over 4 years

@Slav something is up on your end. I can reproduce issue and I can break it in any number of ways but I can also get it took work.
KeyC0de over 4 years

This is the only thing I've found on the internet that outputs unicode text in windows (outputs greek as well). Doesn't even need system("chcp 65001"). After ~3 hours of searching this works. Thanks! Now i need to also learn how to output unicode utf-8 text from file. The Torture never stops.
7vujy0f0hy almost 4 years

Similar to ucshift: mbrlen from wchar.h. I prefer your function though. It’s less non-sense.
ExpertNoob1 almost 4 years

This is the only solution that actually worked for me. Great answer.
Miscreant over 3 years

You'll have to ensure you're actually writing to a console and use other means when the output is to something else. Can you elaborate on that?