Handling special characters in C (UTF-8 encoding)

22,783

Solution 1

First things first:

  1. Read in the buffer
  2. Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
  3. Use the wide character functions in C! Most file/output handling functions have a wide-character variant

Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.

Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.

Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.

#include <stdio.h>
#include <wchar.h>
int main()
{
    FILE *f = fopen("data.txt", "r, ccs=UTF-8");
    if (!f)
        return 1;

    for (wint_t c; (c = fgetwc(f)) != WEOF;)
        printf("%04X\n", c);

    fclose(f);
    return 0;
}

Links

  1. libiconv
  2. Locale data in C/GNU libc
  3. Some handy info
  4. Another good Unicode/UTF-8 in C resource

Solution 2

Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.

It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:

static void print_buffer(const char *buffer, size_t length)
{
  size_t i;

  for(i = 0; i < length; i++)
    printf("%02x ", (unsigned int) buffer[i]);
  putchar('\n');
}

You can do this after loading a very short file, containing just a few characters.

Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

Solution 3

Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.

If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)

Solution 4

I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():

#include <locale.h>
…
setlocale(LC_CTYPE, "");
Share:
22,783
o01
Author by

o01

I hate Webpack. I also hate Babel, and everything else in the unavoidable configuration nightmare that comes with being a Javascript developer. As of July 2020 I hate everyone involved in the design of how CORS and cookies "function". Mongoose has the worst docs and is just awful in general. I mean, WTF? And this bullshit!? express-session can kiss my ass.

Updated on July 09, 2022

Comments

  • o01
    o01 almost 2 years

    I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".

    Is there an easy fix?