Do C++11 regular expressions work with UTF-8 strings?

12,225

Solution 1

You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale. The following test returned true for me on Clang/OS X.

bool test_unicode()
{
    std::locale old;
    std::locale::global(std::locale("en_US.UTF-8"));

    std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
    bool result = std::regex_match(std::string("abcdéfg"), pattern);

    std::locale::global(old);

    return result;
}

NOTE: This was compiled in a file what was UTF-8 encoded.


Just to be safe I also used a string with the explicit hex versions. It worked also.

bool test_unicode2()
{
    std::locale old;
    std::locale::global(std::locale("en_US.UTF-8"));

    std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
    bool result = std::regex_match(std::string("abcd\xC3\xA9""fg"), pattern);

    std::locale::global(old);

    return result;
}

Update test_unicode() still works for me

$ file regex-test.cpp 
regex-test.cpp: UTF-8 Unicode c program text

$ g++ --version
Configured with: --prefix=/Applications/Xcode-8.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode-8.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Solution 2

C++11 regular expressions will "work with" UTF-8 just fine, for a minimal definition of "work". If you want "complete" Unicode regular expression support for UTF-8 strings, you will be better off with a library that supports that directly such as http://www.pcre.org/ .

Share:
12,225

Related videos on Youtube

Mark
Author by

Mark

I'm just another guy interested in programming

Updated on July 11, 2022

Comments

  • Mark
    Mark almost 2 years

    If I want to use C++11's regular expressions with unicode strings, will they work with char* as UTF-8 or do I have to convert them to a wchar_t* string?

  • R. Martinho Fernandes
    R. Martinho Fernandes almost 12 years
    You don't need to save the source code in UTF-8 if you use u8"abcdéfg".
  • Viet
    Viet over 11 years
    Is locale so important? If you ignore locale at all?
  • Jeffery Thomas
    Jeffery Thomas over 11 years
    @Viet There is always a locale. If you don't explicitly set the locale you need, then regex will process with the preexisting locale. I would not expect the regex to to work with UTF-8 strings if the locale is not compatible with UTF-8.
  • Viet
    Viet over 11 years
    @Jeffery Thomas: Thanks. I googled a bit and found that this is applicable to Windows as well.
  • DevSolar
    DevSolar about 11 years
    @ildjarn: ...which needs ICU support compiled in, which unfortunately is not the rule on all platforms, and can be quite a b**** to get to work. ICU, however, has RegEx support of its own...
  • R. Martinho Fernandes
    R. Martinho Fernandes about 11 years
    Regex matching is not a "substring operation".
  • R. Martinho Fernandes
    R. Martinho Fernandes about 11 years
    "abcd\0xC3\0xA9fg" is a string with two embedded null bytes. What you want is probably "abcd\xC3\xA9""fg". Now, I tried this with clang on my Linux box and it quite clearly doesn't work :( gist.github.com/rmartinho/5349044
  • R. Martinho Fernandes
    R. Martinho Fernandes about 11 years
    And then I did some tests on a MacOS box and learned that while [[:alpha:]] can deal with multibyte characters fine, something as basic as . cannot: the regex ".." matches the string u8"é" (or "\xC3\xA9"), which is just unacceptable.
  • jfs
    jfs about 7 years
    std::regex_match(u8"abcdéfg", std::regex("[[:alpha:]]+")) fails for me (g++ 5.4.0 on Ubuntu). But std::regex_match(L"abcdéfg", std::wregex(L"[[:alpha:]]+")) works. (utf-8 locale is enabled in both cases)
  • Jeffery Thomas
    Jeffery Thomas about 7 years
    @J.F.Sebastian I posted my stats. Ensure that the C++ source file is UTF-8 encoded.
  • jfs
    jfs about 7 years
    @JefferyThomas: yes, I'm sure that the source code is utf-8 (though it is not necessary with u8""). Both test_unicode() and test_unicode2() return false (g++ -std=c++11 *.cc && ./a.out). Whatever ideone uses produces the same result.
  • John Greene
    John Greene over 6 years
    GNU C++ libc++ regex library would not work for Japanese (or other multi-byte) characters. For that, you would have to use ICU library.
  • Jeffery Thomas
    Jeffery Thomas over 6 years
    @EgbertS The code I presented is for UTF-8 (which is a multi-byte encoding). If the Japanese text is encoded in a UTF-8 string, the code will work. If you are using another encoding (like Shift-JIS) you would need to convert it to UTF-8.