Convert QString into QByteArray with either UTF-8 or Latin1 encoding

20,793

Things to know:

  • execution character page

There's something called execution character set in the C++ standard which is the term that describes what the output of string and character literals will be in the binary produced by compiler. You can read about it in the 1.1 Character sets subsection of 1 Overview section in The C Preprocessor's Manual on http://gcc.gnu.org site.

Question:
What will be produced as a result of "\u00fc" string literal?

Answer:
It depends on what the execution character set is. In case of gcc (which is what you're using) it's by default UTF-8 unless you specify something different with -fexec-charset option. You can read about this and other options controlling preprocessing phase in the 3.11 Options Controlling the Preprocessor subsection of 3 GCC Command Options section in GCC's Manual on http://gcc.gnu.org site. Now when we know that execution character set is UTF-8 we know that "\u00fc" will be translated to UTF-8 encoding of U+00FC Unicode's code point which is a sequence of two bytes 0xc3 0xbc.

The QString's constructor taking char * calls QString QString::fromAscii ( const char * str, int size = -1 ) which uses codec set with void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) (if any codec had been set) or does the same thing as QString QString::fromLatin1 ( const char * str, int size = -1 ) (in case no codec had been set).

Question:
What codec will be used by QString's constructor to decode two byte sequence (0xc3 0xbc) it gets?

Answer:
By default no codec is set with QTextCodec::setCodecForCStrings() that's why Latin1 will be used to decode byte sequence. As 0xc3 and 0xbc are both valid in Latin 1, representing respectively à and ¼ (this should already be familiar to you as it was taken directly from this answer to your earlier question) we get QString with these two characters.

You shouldn't use QDebug class to output anything outside of ASCII. You have no guarantee what you get.

Test program:

#include <QtCore>

void dbg(char const * rawInput, QString s) {

    QString codepoints;
    foreach(QChar chr, s) {
        codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
    }

    qDebug() << "Input: " << rawInput
             << ", "
             << "Unicode codepoints: " << codepoints;
}

int main(int argc, char *argv[])
{
    QCoreApplication app(argc, argv);

    qDebug() << "system name:"
             << QLocale::system().name();

    for (int i = 1; i <= 5; ++i) {

        switch(i) {

        case 1:
            qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
            break;
        case 2:
            qDebug() << "\nWith codecForCStrings set to UTF-8\n";
            QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
            break;
        case 3:
            qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
            break;
        case 4:
            qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
            break;
        }

        qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
                                           ? QTextCodec::codecForCStrings()->name()
                                           : "NOT SET");
        qDebug() << "codecForLocale:"   << (QTextCodec::codecForLocale()
                                           ? QTextCodec::codecForLocale()->name()
                                           : "NOT SET");

        qDebug() << "\n1. Using QString::QString(char const *)";
        dbg("\\u00fc", QString("\u00fc"));
        dbg("\\xc3\\xbc", QString("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));

        qDebug() << "\n2. Using QString::fromUtf8(char const *)";
        dbg("\\u00fc", QString::fromUtf8("\u00fc"));
        dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));

        qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
        dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
        dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
    }

    return app.exec();
}

Output on mingw 4.4.0 on Windows XP:

system name: "pl_PL"

Without codecForCStrings (default is Latin1)

codecForCStrings: "NOT SET"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "102 13d "
Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

With codecForCStrings set to UTF-8

codecForCStrings: "UTF-8"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "102 13d "
Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8

codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1

codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

I'd like to thank thiago, cbreak, peppe and heinz from #qt freenode.org IRC channel for showing and helping me to understand issues involved here.

Share:
20,793
Johan
Author by

Johan

Make sure you edit your ignore list to something like this, otherwise there is to much noise.... c# .net asp.net dotnet dotnetnuke matlab rails oracle vb.net visualstudio visualstudio2008 visualstudio2005 delphi flash sharepoint sharepoint2007 iis6 asp.net-2.0 javascript visio exchange exchange-server office-2007 msoffice asp.net-ajax asp.net-1.1 asp.net-mvc asp.net-mvc-beta1 vista vista64 ms-access sqlserver2005 sqlserver excel iphone ruby vb6 vc++ .net3.5 windows .net3.5sp1 visualstudio2008express c#.net game-development vs2008 silverlight silverlight-2.0 silverlight-3.0 silverlight-2-rc0 msvcrt msi win32 windowsserver2008 windowsserver2003 microsoft vs2005 sqlserver2000 vbscript iis iis7 iis5 visual-studio visual-studio-2008 visual-studio-2005 msword sql-server windows-mobile c#4.0 sql-server-2008 .net-4.0 vc6 sql-server-2005 vb visual-studio-2010 ms-access-2007 actionscript-3 flex ie adobe-air visual wcf-performance msbuild dllexport exchange2007 exchange-2003 visual-c++ visual-basic microsoft.visualbasic ie6 microsoft-ecsp actionscript wcf ie7 ie7.js ms ado.net visual-studio-team-system iis-7.5 asp.net-mvc-2 tsql c#3.0 asp-classic ie8 ie8-developer-tools etc etc etc Ctrl+Alt+Delete

Updated on July 13, 2022

Comments

  • Johan
    Johan almost 2 years

    I would like to covert a QString into either a utf8 or a latin1 QByteArray, but today I get everything as utf8.

    And I am testing this with some char in the higher segment of latin1 higher than 0x7f, where the german ü is a good example.

    If I do like this:

    QString name("\u00fc"); // U+00FC = ü
    QByteArray utf8;
    utf8.append(name);
    qDebug() << "utf8" << name << utf8.toHex();
    
    QByteArray latin1;
    latin1.append(name.toLatin1());
    qDebug() << "Latin1" << name << latin1.toHex();
    
    QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1");
    QByteArray encodedString = codec->fromUnicode(name);
    qDebug() << "ISO 8859-1" << name << encodedString.toHex();
    

    I get the following output.

    utf8 "ü" "c3bc" 
    Latin1 "ü" "c3bc" 
    ISO 8859-1 "ü" "c3bc" 
    

    As you can see I get the unicode 0xc3bc everywhere, where I would expect to get the Latin1 0xfc for step 2 and 3.

    My guess is that I should get something like this:

    utf8 "ü" "c3bc" 
    Latin1 "ü" "fc" 
    ISO 8859-1 "ü" "fc" 
    

    What is going on here?

    /Thanks


    Links to some character tables:


    This code was build and executed on a Ubuntu 10.04 based system.

    $> uname -a
    Linux frog 2.6.32-28-generic-pae #55-Ubuntu SMP Mon Jan 10 22:34:08 UTC 2011 i686 GNU/Linux
    $> env | grep LANG
    LANG=en_US.utf8
    

    And if I try to use

    utf8.append(name.toUtf8());
    

    I get this output

    utf8 "ü" "c383c2bc" 
    Latin1 "ü" "c3bc" 
    ISO 8859-1 "ü" "c3bc" 
    

    So the latin1 is unicode and the utf8 is double encoded...

    This must depend on some system settings?


    If I run this (could not get the .name() to build)

    qDebug() << "system name:"      << QLocale::system().name();
    qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings();
    qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();
    

    Then I get this:

    system name: "en_US" 
    codecForCStrings: 0x0 
    codecForLocale: "System" 
    

    Solution

    If I specify that it is UTF-8 I am using so the different classes know about this, then it works.

    QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
    QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
    
    qDebug() << "system name:"      << QLocale::system().name();
    qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings()->name();
    qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();
    
    QString name("\u00fc"); 
    QByteArray utf8;
    utf8.append(name);
    qDebug() << "utf8" << name << utf8.toHex();
    
    QByteArray latin1;
    latin1.append(name.toLatin1());
    qDebug() << "Latin1" << name << latin1.toHex();
    
    QTextCodec *codec = QTextCodec::codecForName("latin1");
    QByteArray encodedString = codec->fromUnicode(name);
    qDebug() << "ISO 8859-1" << name << encodedString.toHex();
    

    Then I get this output:

    system name: "en_US" 
    codecForCStrings: "UTF-8" 
    codecForLocale: "UTF-8" 
    utf8 "ü" "c3bc" 
    Latin1 "ü" "fc" 
    ISO 8859-1 "ü" "fc" 
    

    And that looks like it should.