Printing UTF-8 strings with printf - wide vs. multibyte string literals

79,571
printf("ο Δικαιοπολις εν αγρω εστιν\n");

prints the string literal (const char*, special characters are represented as multibyte characters). Although you might see the correct output, there are other problems you might be dealing with while working with non-ASCII characters like these. For example:

char str[] = "αγρω";
printf("%d %d\n", sizeof(str), strlen(str));

outputs 9 8, since each of these special characters is represented by 2 chars.

While using the L prefix you have the literal consisting of wide characters (const wchar_t*) and %ls format specifier causes these wide characters to be converted to multibyte characters (UTF-8). Note that in this case, locale should be set appropriately otherwise this conversion might lead to the output being invalid:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(void)
{
    setlocale(LC_ALL, "");
    printf("%ls", L"ο Δικαιοπολις εν αγρω εστιν");
    return 0;
}

but while some things might get more complicated when working with wide characters, other things might get much simpler and more straightforward. For example:

wchar_t str[] = L"αγρω";
printf("%d %d", sizeof(str) / sizeof(wchar_t), wcslen(str));

will output 5 4 as one would naturally expect.

Once you decide to work with wide strings, wprintf can be used to print wide characters directly. It's also worth to note here that in case of Windows console, the translation mode of the stdout should be explicitly set to one of the Unicode modes by calling _setmode:

#include <stdio.h>
#include <wchar.h>

#include <io.h>
#include <fcntl.h>
#ifndef _O_U16TEXT
  #define _O_U16TEXT 0x20000
#endif

int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    wprintf(L"%s\n", L"ο Δικαιοπολις εν αγρω εστιν");
    return 0;
}
Share:
79,571

Related videos on Youtube

teppic
Author by

teppic

I used to write the Linux and Unix column in Personal Computer World, until it closed a couple of years ago. My interest is mainly Linux, open source software and development (mainly C).

Updated on March 23, 2020

Comments

  • teppic
    teppic about 4 years

    In statements like these, where both are entered into the source code with the same encoding (UTF-8) and the locale is set up properly, is there any practical difference between them?

    printf("ο Δικαιοπολις εν αγρω εστιν\n");
    printf("%ls", L"ο Δικαιοπολις εν αγρω εστιν\n");
    

    And consequently is there any reason to prefer one over the other when doing output? I imagine the second performs a fair bit worse, but does it have any advantage (or disadvantage) over a multibyte literal?

    EDIT: There are no issues with these strings printing. But I'm not using the wide string functions, because I want to be able to use printf etc. as well. So the question is are these ways of printing any different (given the situation outlined above), and if so, does the second one have any advantage?

    EDIT2: Following the comments below, I now know that this program works -- which I thought wasn't possible:

    int main()
    {
        setlocale(LC_ALL, "");
        wprintf(L"ο Δικαιοπολις εν αγρω εστιν\n");  // wide output
        freopen(NULL, "w", stdout);                 // lets me switch
        printf("ο Δικαιοπολις εν αγρω εστιν\n");    // byte output
    }
    

    EDIT3: I've done some further research by looking at what's going on with the two types. Take a simpler string:

    wchar_t *wides = L"£100 π";
    char *mbs = "£100 π";
    

    The compiler is generating different code. The wide string is:

    .string "\243"
    .string ""
    .string ""
    .string "1"
    .string ""
    .string ""
    .string "0"
    .string ""
    .string ""
    .string "0"
    .string ""
    .string ""
    .string " "
    .string ""
    .string ""
    .string "\300\003"
    .string ""
    .string ""
    .string ""
    .string ""
    .string ""
    

    While the second is:

    .string "\302\243100 \317\200"
    

    And looking at the Unicode encodings, the second is plain UTF-8. The wide character representation is UTF-32. I realise this is going to be implementation-dependent.

    So perhaps the wide character representation of literals is more portable? My system will not print UTF-16/UTF-32 encodings directly, so it is being automatically converted to UTF-8 for output.

    • Adrian McCarthy
      Adrian McCarthy about 11 years
      You said both examples are entered with UTF-8. In the second sample line, if that text is actually UTF-8 rather than a wide encoding, then you probably shouldn't have the L prefix, and therefore you'd just use %s rather than %ls. Or I'm still misunderstanding the question.
    • teppic
      teppic about 11 years
      @AdrianMcCarthy - both strings in the source code are UTF-8, yes. But a string literal is always multibyte -- "A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz". A wide string literal is the same, except prefixed by the letter L." from the standard.
    • DevSolar
      DevSolar about 11 years
      AFAIR, any characters not in the Basic Source Character Set (which is a subset of US-ASCII-7) invoke implementation-defined behaviour, i.e. everything discussed here is effectively depending on the compiler used. If you really want to play it safe (and portable), you would have to resort to \u... and \U...
    • teppic
      teppic about 11 years
      It might well be in the area of implementation. What I'm trying to do is switch to wide character representation all the time, but stick to the regular stdio functions for output, so as not to break compatibility with all the stuff that expects them to work. I'm really just wondering if I should stick with multibyte literals alone (as above) or if there's a reason to use wide literals. It's hard to explain and I'm not doing a very good job!
    • Pavel Radzivilovsky
      Pavel Radzivilovsky about 11 years
      utf8everywhere.org pretty much convinces that usage of L"" should be discouraged, especially on platform Windows.
  • teppic
    teppic about 11 years
    That's me :) wprintf converts to multibyte too, but I'm interested in the standard functions.
  • LihO
    LihO about 11 years
    @teppic: See my answer now. It should be finally more satisfying I guess :)
  • DevSolar
    DevSolar about 11 years
    UTF-16 is not "wide", and it's really a shame that this bit of myth is still around. There are more than 2^16 Unicode characters, and UTF-16 encodes them with a variable width of either one or two 16-bit code units. If you want "wide", you have to resort to UTF-32. Let's not get into that trap of thinking that n bit should be enough for everybody, again.
  • LihO
    LihO about 11 years
    @DevSolar: I removed the confusing "UTF-16".
  • DevSolar
    DevSolar about 11 years
    Thanks. I'm working on strongly Unicode related stuff professionally, and it's just so sad to see how much half-baked knowledge on the subject is around. UTF-16 is a perfect example: Effectively a multibyte encoding, with embedded zero bytes. It's astonishing how much "Unicode-aware" software can be made to barf with a bit of ancient Greek, some extended CJK or one or two hieroglyphs. Not to mention combining characters and other such niceties. ;-)
  • teppic
    teppic about 11 years
    @DevSolar - I'm impressed you recognised that as ancient Greek (unless it was coincidence) :)
  • teppic
    teppic about 11 years
    @LihO - I agree with the things you said. I ran into problems with functions like strlen a while ago before I knew about wide characters. For anything internal I'd use the wide string functions, but the moment you use a wide string output function on stdout, you can't use any regular ones again -- that's why I'm not using wprintf. I expect the answer is essentially there is no difference, as long as the locale is set and you don't need to process the literal in any way.
  • DevSolar
    DevSolar about 11 years
    @teppic: Coincidence, I admit. I just named a couple of alphabets outside the 16-bit range. As for stdout being "tainted" by wide output, be advised that you can reset the wide orientation via fwide( stdout, -1 ).
  • teppic
    teppic about 11 years
    @DevSolar - fwide can only be used to set the stream initially, it can't change it once it's oriented, unfortunately.
  • DevSolar
    DevSolar about 11 years
    @teppic: Dang... Footnote 287, missed that. Well, you could still use freopen... although that seems a bit heavy-handed.
  • DevSolar
    DevSolar about 11 years
    @teppic: So I missed footnote 287 of the C99 standard, and you missed footnote 232 of it. ;-) I quote: "The primary use of the freopen function is to change the file associated with a standard text stream (stderr, stdin, or stdout), as those identifiers need not be modifiable lvalues to which the value returned by the fopen function may be assigned." With something like freopen( "test", "r", stdin ) you get stdin to read from a file, which is useful for e.g. testing stdin-reading functions.
  • teppic
    teppic about 11 years
    @DevSolar - that's for redirecting the file descriptors to a filename though? You'd call it something like freopen("/tmp/output", "w", stdout); (I want to keep stdout as stdout)
  • DevSolar
    DevSolar about 11 years
    @teppic: "If filename is a null pointer, the freopen function attempts to change the mode of the stream to that specified by mode, as if the name of the file currently associated with the stream had been used. It is implementation-defined which changes of mode are permitted (if any), and under what circumstances." I.e., implementation-defined, but worth a try.
  • teppic
    teppic about 11 years
    @DevSolar: I'm sure I tried that, but I'll try it now - thanks. If it doesn't work I'll post a new question on this specifically. I obviously didn't - it works! On Linux, that is.