QString to unicode std::string

11,184

Solution 1

The below applies to Qt 5. Qt 4's behavior was different and, in practice, broken.

You need to choose:

Whether you want the 8-bit wide std::string or 16-bit wide std::wstring, or some other type.
What encoding is desired in your target string?

Internally, QString stores UTF-16 encoded data, so any Unicode code point may be represented in one or two QChars.

Common cases:

Locally encoded 8-bit std::string (as in: system locale):
```
std::string(str.toLocal8Bit().constData())
```

UTF-8 encoded 8-bit std::string:

str.toStdString()

This is equivalent to:

std::string(str.toUtf8().constData())

UTF-16 or UCS-4 encoded std::wstring, 16- or 32 bits wide, respectively. The selection of 16- vs. 32-bit encoding is done by Qt to match the platform's width of wchar_t.
```
str.toStdWString()
```
U16 or U32 strings of C++11 - from Qt 5.5 onwards:
```
str.toStdU16String()
str.toStdU32String()
```
UTF-16 encoded 16-bit std::u16string - this hack is only needed up to Qt 5.4:
```
std::u16string(reinterpret_cast<const char16_t*>(str.constData()))
```
This encoding does not include byte order marks (BOMs).

It's easy to prepend BOMs to the QString itself before converting it:

QString src = ...;
src.prepend(QChar::ByteOrderMark);
#if QT_VERSION < QT_VERSION_CHECK(5,5,0)
auto dst = std::u16string{reinterpret_cast<const char16_t*>(src.constData()),
                          src.size()};
#else
auto dst = src.toStdU16String();

If you expect the strings to be large, you can skip one copy:

const QString src = ...;
std::u16string dst;
dst.reserve(src.size() + 2); // BOM + termination
dst.append(char16_t(QChar::ByteOrderMark));
dst.append(reinterpret_cast<const char16_t*>(src.constData()),
           src.size()+1);

In both cases, dst is now portable to systems with either endianness.

Solution 2

Use this:

QString Widen(const std::string &stdStr)
{
    return QString::fromUtf8(stdStr.data(), stdStr.size());
}
std::string Narrow(const QString &qtStr)
{
    QByteArray utf8 = qtStr.toUtf8();
    return std::string(utf8.data(), utf8.size());
}

In all cases you should have utf8 in std::string.

11,184

Oleg Andriyanov

Updated on June 04, 2022

Comments

Oleg Andriyanov 7 months
I know there is plenty of information about converting QString to char*, but I still need some clarification in this question.

Qt provides QTextCodecs to convert QString (which internally stores characters in unicode) to QByteArray, allowing me to retrieve char* which represents the string in some non-unicode encoding. But what should I do when I want to get a unicode QByteArray?
```
QTextCodec* codec = QTextCodec::codecForName("UTF-8");
QString qstr = codec->toUnicode("Юникод");
std::string stdstr(reinterpret_cast<const char*>(qstr.constData()), qstr.size() * 2 );  // * 2 since unicode character is twice longer than char
qDebug() << QString(reinterpret_cast<const QChar*>(stdstr.c_str()), stdstr.size() / 2); // same
```
The above code prints "Юникод" as I've expected. But I'd like to know if that is the right way to get to the unicode char* of the QString. In particular, reinterpret_casts and size arithmetics in this technique looks pretty ugly.
- Kuba hasn't forgotten Monica over 8 years
  
  "you mean UTF8 and Unicode are equal" No. Your use of the word Unicode is wrong. Unicode is not an encoding, it's a standard, so talking of a "Unicode std::string" doesn't mean anything. A string by itself can't be unicode compliant. An std::string will have a particular "character" type (usually either 8 or 16 bits wide), and it will have a particular encoding (UCS-2 or UTF-16 for 16 bit characters, usually). The big difference between UCS-2 and UTF-16 is that UCS-2 is fixed-width: one code point per "character". In UTF-16, there may be multiple "characters" per code point.
- Kuba hasn't forgotten Monica over 8 years
  
  The phrase "unicode QByteArray" is meaningless. It is equivalent to saying "wakalixes QByteArray". A byte array can carry text data in some 8-bit encoding, such as Latin1 (ISO/IEC 8859-1), or UTF-8, etc. If you want an 8-bit encoded byte array as a representation of a string, you need to know what encoding is expected by the user of such an array. Only then can you decide how to encode the string.
- Kuba hasn't forgotten Monica over 8 years
  
  Please edit your question's title to indicate what encoding is desired in the std::string, and whether the string is 8- or 16-bits wide.
- Kuba hasn't forgotten Monica over 8 years
  
  OK, presuming that it is indeed std::string and not std::wstring, the string is 8 bit wide, but the encoding question still remains.
Kuba hasn't forgotten Monica over 8 years

There is no such thing as a "unicode byte array" - please stop using this term, it confuses everyone. Unicode is a standard, not an encoding. There's UTF-16 and UCS-2, and the latter is what QString is internally encoded as. UCS-2 is a subset of UTF-16 for code points 0-0xFFFF. Since a QString can't carry code points outside of that range, you don't need to do anything special to get UTF-16 out of a QString. Just use the string's constData().
Nejat over 8 years

@KubaOber Using constData() also gets you the BOM at the begging which is a mess. Using the mentioned approach you can get the QByteArray related to string and also you can use different encoding options.
Kuba hasn't forgotten Monica over 8 years

Are you sure that QString stores the embedded BOM?
Nejat over 8 years

Yeah definitely. You can see stackoverflow.com/questions/3602548/…
Kuba hasn't forgotten Monica over 8 years

The first answer in your link seems to contradict you.
Kuba hasn't forgotten Monica over 8 years

In fact, I've just checked, and QString does not carry an embedded BOM. It'd be a waste of space. This code would dump out the BOM; it doesn't: QString str1(QStringLiteral("A")); const QChar * p = str1.constData(); while (p->unicode()) qDebug() << *p++;
Len almost 5 years

Why is stdStr.size() necessary when calling fromUtf8? Does that result in storing the terminating null in the QString? Otherwise, it appears fromUtf8 defaults to reading up to the terminating null...