How to uppercase/lowercase UTF-8 characters in C++?

12,787

Solution 1

There is no standard way to do Unicode case conversion in C++. There are ways that work on some C++ implementations, but the standard doesn't require them to.

If you want guaranteed Unicode case conversion, you will need to use a library like ICU or Boost.Locale (aka: ICU with a more C++-like interface).

Solution 2

This code is a carefully tested UTF8 case conversion/case insensitive cmp.

It is supposed to be correct (if any bugs are found please tell).

This function covering the case sensitive character sets in the UTF8 and how to use it for cmp.

unsigned char* StrToUprExt(unsigned char* pString) (separate answer below, answer space)
unsigned char* StrToLwrExt(unsigned char* pString)
int StrnCiCmp(const char* s1, const char* s2, size_t ztCount)
int StrCiCmp(const char* s1, const char* s2)
char* StrCiStr(const char* s1, const char* s2)

These characters are to be converted:

ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƊƋƎƏƐƑƓƔƖƗƘƜƝƠƢƤƧƩƬƮƯƱƲƳƵƷƸƼDŽDžLJLjNJNjǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZDzǴǶǷǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃɄɅɆɈɊɌɎͰͲͶͿΆΈΉΊΌΎΏΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫϏϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹϺϽϾϿЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿՀՁՂՃՄՅՆՇՈՉՊՋՌՍՎՏՐՑՒՓՔՕՖႠႡႢႣႤႥႦႧႨႩႪႫႬႭႮႯႰႱႲႳႴႵႶႷႸႹႺႻႼႽႾႿჀჁჂჃჄჅჇჍᎠᎡᎢᎣᎤᎥᎦᎧᎨᎩᎪᎫᎬᎭᎮᎯᎰᎱᎲᎳᎴᎵᎶᎷᎸᎹᎺᎻᎼᎽᎾᎿᏀᏁᏂᏃᏄᏅᏆᏇᏈᏉᏊᏋᏌᏍᏎᏏᏐᏑᏒᏓᏔᏕᏖᏗᏘᏙᏚᏛᏜᏝᏞᏟᏠᏡᏢᏣᏤᏥᏦᏧᏨᏩᏪᏫᏬᏭᏮᏯᏰᏱᏲᏳᏴᏵᲐᲑᲒᲓᲔᲕᲖᲗᲘᲙᲚᲛᲜᲝᲞᲟᲠᲡᲢᲣᲤᲥᲦᲧᲨᲩᲪᲫᲬᲭᲮᲯᲰᲱᲲᲳᲴᲵᲶᲷᲸᲹᲺᲽᲾᲿḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈἉἊἋἌἍἎἏἘἙἚἛἜἝἨἩἪἫἬἭἮἯἸἹἺἻἼἽἾἿὈὉὊὋὌὍὙὛὝὟὨὩὪὫὬὭὮὯᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾸᾹᾺΆᾼῈΈῊΉῌῘῙῚΊῨῩῪΎῬῸΌῺΏῼⰀⰁⰂⰃⰄⰅⰆⰇⰈⰉⰊⰋⰌⰍⰎⰏⰐⰑⰒⰓⰔⰕⰖⰗⰘⰙⰚⰛⰜⰝⰞⰟⰠⰡⰢⰣⰤⰥⰦⰧⰨⰩⰪⰫⰬⰭⰮⱠⱢⱣⱤⱧⱩⱫⱭⱮⱯⱰⱲⱵⱾⱿⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲⴀⴁⴂⴃⴄⴅⴆⴇⴈⴉⴊⴋⴌⴍⴎⴏⴐⴑⴒⴓⴔⴕⴖⴗⴘⴙⴚⴛⴜⴝⴞⴟⴠⴡⴢⴣⴤⴥⴧⴭꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦꞨꞪꞫꞬꞭꞮꞰꞱꞲꞳꞴꞶꞸꞺꞼꞾꟂꟄꟅꟆꟇꟉꟵABCDEFGHIJKLMNOPQRSTUVWXYZ𐐀𐐁𐐂𐐃𐐄𐐅𐐆𐐇𐐈𐐉𐐊𐐋𐐌𐐍𐐎𐐏𐐐𐐑𐐒𐐓𐐔𐐕𐐖𐐗𐐘𐐙𐐚𐐛𐐜𐐝𐐞𐐟𐐠𐐡𐐢𐐣𐐤𐐥𐐦𐐧𐒰𐒱𐒲𐒳𐒴𐒵𐒶𐒷𐒸𐒹𐒺𐒻𐒼𐒽𐒾𐒿𐓀𐓁𐓂𐓃𐓄𐓅𐓆𐓇𐓈𐓉𐓊𐓋𐓌𐓍𐓎𐓏𐓐𐓑𐓒𐓓𐲀𐲁𐲂𐲃𐲄𐲅𐲆𐲇𐲈𐲉𐲊𐲋𐲌𐲍𐲎𐲏𐲐𐲑𐲒𐲓𐲔𐲕𐲖𐲗𐲘𐲙𐲚𐲛𐲜𐲝𐲞𐲟𐲠𐲡𐲢𐲣𐲤𐲥𐲦𐲧𐲨𐲩𐲪𐲫𐲬𐲭𐲮𐲯𐲰𐲱𐲲𑢠𑢡𑢢𑢣𑢤𑢥𑢦𑢧𑢨𑢩𑢪𑢫𑢬𑢭𑢮𑢯𑢰𑢱𑢲𑢳𑢴𑢵𑢶𑢷𑢸𑢹𑢺𑢻𑢼𑢽𑢾𑢿𖹀𖹁𖹂𖹃𖹄𖹅𖹆𖹇𖹈𖹉𖹊𖹋𖹌𖹍𖹎𖹏𖹐𖹑𖹒𖹓𖹔𖹕𖹖𖹗𖹘𖹙𖹚𖹛𖹜𖹝𖹞𖹟𞤀𞤁𞤂𞤃𞤄𞤅𞤆𞤇𞤈𞤉𞤊𞤋𞤌𞤍𞤎𞤏𞤐𞤑𞤒𞤓𞤔𞤕𞤖𞤗𞤘𞤙𞤚𞤛𞤜𞤝𞤞𞤟𞤠𞤡

Remarks

It handles umlaut letters as own, as a and á are different, to handle them as the same in compare cases would require a far more complicated solution. Some umlaut characters only exist in Lwr or Upr case, and are ignored.

  • There might be by me undiscovered UFT8 characters for Lwr/Upr conversion.
  • There are about a hundred lower and uppercase charcters with no partner and could obviously not be converted either.
  • All four unicase Georgian scripts asomtavruli, mtavruli, nuskhuri are converted to Mkhedruli in the StrToLwrExt() in order to make texts of the same language with the same letters and content comparable as the same. The StrToUprExt () converts Mkhedruli to the mtavruli.
  • There are 21 pairs of character where one side is two byte and the other is three byte UTF8 and converting them would lead to the risk of sync failure in the strstr() functions.

Capital = Small

0xc8 0xba = 0xe2 0xb1 0xa5

0xc8 0xbe = 0xe2 0xb1 0xa6

0xe1 0xba 0x9e = 0xc3 0x9f

0xe2 0xb1 0xa2 = 0xc9 0xab

0xe2 0xb1 0xa4 = 0xc9 0xbd

0xe2 0xb1 0xad = 0xc9 0x91

0xe2 0xb1 0xae = 0xc9 0xb1

0xe2 0xb1 0xaf = 0xc9 0x90

0xe2 0xb1 0xb0 = 0xc9 0x92

0xe2 0xb1 0xbe = 0xc8 0xbf

0xe2 0xb1 0xbf = 0xc9 0x80

0xea 0x9e 0x8d = 0xc9 0xa5

0xea 0x9e 0xaa = 0xc9 0xa6

0xea 0x9e 0xab = 0xc9 0x9c

0xea 0x9e 0xac = 0xc9 0xa1

0xea 0x9e 0xad = 0xc9 0xac

0xea 0x9e 0xae = 0xc9 0xaa

0xea 0x9e 0xb0 = 0xca 0x9e

0xea 0x9e 0xb1 = 0xca 0x87

0xea 0x9e 0xb2 = 0xca 0x9d

0xea 0x9f 0x85 = 0xca 0x82

The code do not handle recovery by incorrect multibyte chars in the strings (rare problem but distinguished of multichar strings), it will resync. It is not a topic for this answer.

A buffer overrun at runtime is possible (but hardly likely). This happens when a string has incomplete multibyte char at the end. This may happen if strings are cut during processing breaking a multibyte char. But with today's massive supply of memory, allocate memory for a complete string? Else if you want it buffer overrun safe you need to handle that issue yourself. It is not a topic for this answer.

unsigned char* StrToLwrExt(unsigned char* pString)
{
unsigned char* p = pString;
unsigned char* pExtChar = 0;

if (pString && *pString) {
        while (*p) {
            if ((*p >= 0x41) && (*p <= 0x5a)) /* US ASCII */
                (*p) += 0x20;
            else if (*p > 0xc0) {
                pExtChar = p;
                p++;
                switch (*pExtChar) {
                case 0xc3: /* Latin 1 */
                    if ((*p >= 0x80)
                        && (*p <= 0x9e)
                        && (*p != 0x97))
                        (*p) += 0x20; /* US ASCII shift */
                    break;
                case 0xc4: /* Latin ext */
                    if (((*p >= 0x80)
                        && (*p <= 0xb7)
                        && (*p != 0xb0))
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0xb9)
                        && (*p <= 0xbe)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbf) {
                        *pExtChar = 0xc5;
                        (*p) = 0x80;
                    }
                    break;
                case 0xc5: /* Latin ext */
                    if ((*p >= 0x81)
                        && (*p <= 0x88)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x8a)
                        && (*p <= 0xb7)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb8) {
                        *pExtChar = 0xc3;
                        (*p) = 0xbf;
                    }
                    else if ((*p >= 0xb9)
                        && (*p <= 0xbe)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xc6: /* Latin ext */
                    switch (*p) {
                    case 0x81:
                        *pExtChar = 0xc9;
                        (*p) = 0x93;
                        break;
                    case 0x86:
                        *pExtChar = 0xc9;
                        (*p) = 0x94;
                        break;
                    case 0x89:
                        *pExtChar = 0xc9;
                        (*p) = 0x96;
                        break;
                    case 0x8a:
                        *pExtChar = 0xc9;
                        (*p) = 0x97;
                        break;
                    case 0x8e:
                        *pExtChar = 0xc9;
                        (*p) = 0x98;
                        break;
                    case 0x8f:
                        *pExtChar = 0xc9;
                        (*p) = 0x99;
                        break;
                    case 0x90:
                        *pExtChar = 0xc9;
                        (*p) = 0x9b;
                        break;
                    case 0x93:
                        *pExtChar = 0xc9;
                        (*p) = 0xa0;
                        break;
                    case 0x94:
                        *pExtChar = 0xc9;
                        (*p) = 0xa3;
                        break;
                    case 0x96:
                        *pExtChar = 0xc9;
                        (*p) = 0xa9;
                        break;
                    case 0x97:
                        *pExtChar = 0xc9;
                        (*p) = 0xa8;
                        break;
                    case 0x9c:
                        *pExtChar = 0xc9;
                        (*p) = 0xaf;
                        break;
                    case 0x9d:
                        *pExtChar = 0xc9;
                        (*p) = 0xb2;
                        break;
                    case 0x9f:
                        *pExtChar = 0xc9;
                        (*p) = 0xb5;
                        break;
                    case 0xa9:
                        *pExtChar = 0xca;
                        (*p) = 0x83;
                        break;
                    case 0xae:
                        *pExtChar = 0xca;
                        (*p) = 0x88;
                        break;
                    case 0xb1:
                        *pExtChar = 0xca;
                        (*p) = 0x8a;
                        break;
                    case 0xb2:
                        *pExtChar = 0xca;
                        (*p) = 0x8b;
                        break;
                    case 0xb7:
                        *pExtChar = 0xca;
                        (*p) = 0x92;
                        break;
                    case 0x82:
                    case 0x84:
                    case 0x87:
                    case 0x8b:
                    case 0x91:
                    case 0x98:
                    case 0xa0:
                    case 0xa2:
                    case 0xa4:
                    case 0xa7:
                    case 0xac:
                    case 0xaf:
                    case 0xb3:
                    case 0xb5:
                    case 0xb8:
                    case 0xbc:
                        (*p)++; /* Next char is lwr */
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xc7: /* Latin ext */
                    if (*p == 0x84)
                        (*p) = 0x86;
                    else if (*p == 0x85)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0x87)
                        (*p) = 0x89;
                    else if (*p == 0x88)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0x8a)
                        (*p) = 0x8c;
                    else if (*p == 0x8b)
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x8d)
                        && (*p <= 0x9c)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x9e)
                        && (*p <= 0xaf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb1)
                        (*p) = 0xb3;
                    else if (*p == 0xb2)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb4)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb6) {
                        *pExtChar = 0xc6;
                        (*p) = 0x95;
                    }
                    else if (*p == 0xb7) {
                        *pExtChar = 0xc6;
                        (*p) = 0xbf;
                    }
                    else if ((*p >= 0xb8)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xc8: /* Latin ext */
                    if ((*p >= 0x80)
                        && (*p <= 0x9f)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xa0) {
                        *pExtChar = 0xc6;
                        (*p) = 0x9e;
                    }
                    else if ((*p >= 0xa2)
                        && (*p <= 0xb3)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbb)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbd) {
                        *pExtChar = 0xc6;
                        (*p) = 0x9a;
                    }
                    /* 0xba three byte small 0xe2 0xb1 0xa5 */
                    /* 0xbe three byte small 0xe2 0xb1 0xa6 */
                    break;
                case 0xc9: /* Latin ext */
                    if (*p == 0x81)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0x83) {
                        *pExtChar = 0xc6;
                        (*p) = 0x80;
                    }
                    else if (*p == 0x84) {
                        *pExtChar = 0xca;
                        (*p) = 0x89;
                    }
                    else if (*p == 0x85) {
                        *pExtChar = 0xca;
                        (*p) = 0x8c;
                    }
                    else if ((*p >= 0x86)
                        && (*p <= 0x8f)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xcd: /* Greek & Coptic */
                    switch (*p) {
                    case 0xb0:
                    case 0xb2:
                    case 0xb6:
                        (*p)++; /* Next char is lwr */
                        break;
                    case 0xbf:
                        *pExtChar = 0xcf;
                        (*p) = 0xb3;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xce: /* Greek & Coptic */
                    if (*p == 0x86)
                        (*p) = 0xac;
                    else if (*p == 0x88)
                        (*p) = 0xad;
                    else if (*p == 0x89)
                        (*p) = 0xae;
                    else if (*p == 0x8a)
                        (*p) = 0xaf;
                    else if (*p == 0x8c) {
                        *pExtChar = 0xcf;
                        (*p) = 0x8c;
                    }
                    else if (*p == 0x8e) {
                        *pExtChar = 0xcf;
                        (*p) = 0x8d;
                    }
                    else if (*p == 0x8f) {
                        *pExtChar = 0xcf;
                        (*p) = 0x8e;
                    }
                    else if ((*p >= 0x91)
                        && (*p <= 0x9f))
                        (*p) += 0x20; /* US ASCII shift */
                    else if ((*p >= 0xa0)
                        && (*p <= 0xab)
                        && (*p != 0xa2)) {
                        *pExtChar = 0xcf;
                        (*p) -= 0x20;
                    }
                    break;
                case 0xcf: /* Greek & Coptic */
                    if (*p == 0x8f)
                        (*p) = 0x97;
                    else if ((*p >= 0x98)
                        && (*p <= 0xaf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb4) {
                        (*p) = 0x91;
                    }
                    else if (*p == 0xb7)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb9)
                        (*p) = 0xb2;
                    else if (*p == 0xba)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbd) {
                        *pExtChar = 0xcd;
                        (*p) = 0xbb;
                    }
                    else if (*p == 0xbe) {
                        *pExtChar = 0xcd;
                        (*p) = 0xbc;
                    }
                    else if (*p == 0xbf) {
                        *pExtChar = 0xcd;
                        (*p) = 0xbd;
                    }
                    break;
                case 0xd0: /* Cyrillic */
                    if ((*p >= 0x80)
                        && (*p <= 0x8f)) {
                        *pExtChar = 0xd1;
                        (*p) += 0x10;
                    }
                    else if ((*p >= 0x90)
                        && (*p <= 0x9f))
                        (*p) += 0x20; /* US ASCII shift */
                    else if ((*p >= 0xa0)
                        && (*p <= 0xaf)) {
                        *pExtChar = 0xd1;
                        (*p) -= 0x20;
                    }
                    break;
                case 0xd1: /* Cyrillic supplement */
                    if ((*p >= 0xa0)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xd2: /* Cyrillic supplement */
                    if (*p == 0x80)
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x8a)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xd3: /* Cyrillic supplement */
                    if (*p == 0x80)
                        (*p) = 0x8f;
                    else if ((*p >= 0x81)
                        && (*p <= 0x8e)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x90)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xd4: /* Cyrillic supplement & Armenian */
                    if ((*p >= 0x80)
                        && (*p <= 0xaf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0xb1)
                        && (*p <= 0xbf)) {
                        *pExtChar = 0xd5;
                        (*p) -= 0x10;
                    }
                    break;
                case 0xd5: /* Armenian */
                    if ((*p >= 0x80)
                        && (*p <= 0x8f)) {
                        (*p) += 0x30;
                    }
                    else if ((*p >= 0x90)
                        && (*p <= 0x96)) {
                        *pExtChar = 0xd6;
                        (*p) -= 0x10;
                    }
                    break;
                case 0xe1: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x82: /* Georgian asomtavruli */
                        if ((*p >= 0xa0)
                            && (*p <= 0xbf)) {
                            *pExtChar = 0x83;
                            (*p) -= 0x10;
                        }
                        break;
                    case 0x83: /* Georgian asomtavruli */
                        if (((*p >= 0x80)
                            && (*p <= 0x85))
                            || (*p == 0x87)
                            || (*p == 0x8d))
                            (*p) += 0x30;
                        break;
                    case 0x8e: /* Cherokee */
                        if ((*p >= 0xa0)
                            && (*p <= 0xaf)) {
                            *(p - 2) = 0xea;
                            *pExtChar = 0xad;
                            (*p) += 0x10;
                        }
                        else if ((*p >= 0xb0)
                            && (*p <= 0xbf)) {
                            *(p - 2) = 0xea;
                            *pExtChar = 0xae;
                            (*p) -= 0x30;
                        }
                        break;
                    case 0x8f: /* Cherokee */
                        if ((*p >= 0x80)
                            && (*p <= 0xaf)) {
                            *(p - 2) = 0xea;
                            *pExtChar = 0xae;
                            (*p) += 0x10;
                        }
                        else if ((*p >= 0xb0)
                            && (*p <= 0xb5)) {
                            (*p) += 0x08;
                        }
                        /* 0xbe three byte small 0xe2 0xb1 0xa6 */
                        break;
                    case 0xb2: /* Georgian mtavruli */
                        if (((*p >= 0x90)
                            && (*p <= 0xba))
                            || (*p == 0xbd)
                            || (*p == 0xbe)
                            || (*p == 0xbf))
                            *pExtChar = 0x83;
                        break;
                    case 0xb8: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xb9: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xba: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0x94)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        else if ((*p >= 0xa0)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        /* 0x9e Two byte small 0xc3 0x9f */
                        break;
                    case 0xbb: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xbc: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8f))
                            (*p) -= 0x08;
                        else if ((*p >= 0x98)
                            && (*p <= 0x9d))
                            (*p) -= 0x08;
                        else if ((*p >= 0xa8)
                            && (*p <= 0xaf))
                            (*p) -= 0x08;
                        else if ((*p >= 0xb8)
                            && (*p <= 0xbf))
                            (*p) -= 0x08;
                        break;
                    case 0xbd: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8d))
                            (*p) -= 0x08;
                        else if ((*p == 0x99)
                            || (*p == 0x9b)
                            || (*p == 0x9d)
                            || (*p == 0x9f))
                            (*p) -= 0x08;
                        else if ((*p >= 0xa8)
                            && (*p <= 0xaf))
                            (*p) -= 0x08;
                        break;
                    case 0xbe: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8f))
                            (*p) -= 0x08;
                        else if ((*p >= 0x98)
                            && (*p <= 0x9f))
                            (*p) -= 0x08;
                        else if ((*p >= 0xa8)
                            && (*p <= 0xaf))
                            (*p) -= 0x08;
                        else if ((*p >= 0xb8)
                            && (*p <= 0xb9))
                            (*p) -= 0x08;
                        else if ((*p >= 0xba)
                            && (*p <= 0xbb)) {
                            *(p - 1) = 0xbd;
                            (*p) -= 0x0a;
                        }
                        else if (*p == 0xbc)
                            (*p) -= 0x09;
                        break;
                    case 0xbf: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8b)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x2a;
                        }
                        else if (*p == 0x8c)
                            (*p) -= 0x09;
                        else if ((*p >= 0x98)
                            && (*p <= 0x99))
                            (*p) -= 0x08;
                        else if ((*p >= 0x9a)
                            && (*p <= 0x9b)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x1c;
                        }
                        else if ((*p >= 0xa8)
                            && (*p <= 0xa9))
                            (*p) -= 0x08;
                        else if ((*p >= 0xaa)
                            && (*p <= 0xab)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x10;
                        }
                        else if (*p == 0xac)
                            (*p) -= 0x07;
                        else if ((*p >= 0xb8)
                            && (*p <= 0xb9)) {
                            *(p - 1) = 0xbd;
                        }
                        else if ((*p >= 0xba)
                            && (*p <= 0xbb)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x02;
                        }
                        else if (*p == 0xbc)
                            (*p) -= 0x09;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xe2: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0xb0: /* Glagolitic */
                        if ((*p >= 0x80)
                            && (*p <= 0x8f)) {
                            (*p) += 0x30;
                        }
                        else if ((*p >= 0x90)
                            && (*p <= 0xae)) {
                            *pExtChar = 0xb1;
                            (*p) -= 0x10;
                        }
                        break;
                    case 0xb1: /* Latin ext */
                        switch (*p) {
                        case 0xa0:
                        case 0xa7:
                        case 0xa9:
                        case 0xab:
                        case 0xb2:
                        case 0xb5:
                            (*p)++; /* Next char is lwr */
                            break;
                        case 0xa2: /* Two byte small 0xc9 0xab */
                        case 0xa4: /* Two byte small 0xc9 0xbd */
                        case 0xad: /* Two byte small 0xc9 0x91 */
                        case 0xae: /* Two byte small 0xc9 0xb1 */
                        case 0xaf: /* Two byte small 0xc9 0x90 */
                        case 0xb0: /* Two byte small 0xc9 0x92 */
                        case 0xbe: /* Two byte small 0xc8 0xbf */
                        case 0xbf: /* Two byte small 0xc9 0x80 */
                            break;
                        case 0xa3:
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0xb5;
                            *(p) = 0xbd;
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0xb2: /* Coptic */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xb3: /* Coptic */
                        if (((*p >= 0x80)
                            && (*p <= 0xa3)
                            && (!(*p % 2))) /* Even */
                            || (*p == 0xab)
                            || (*p == 0xad)
                            || (*p == 0xb2))
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xb4: /* Georgian nuskhuri */
                        if (((*p >= 0x80)
                            && (*p <= 0xa5))
                            || (*p == 0xa7)
                            || (*p == 0xad)) {
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0x83;
                            (*p) += 0x10;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xea: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x99: /* Cyrillic */
                        if ((*p >= 0x80)
                            && (*p <= 0xad)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0x9a: /* Cyrillic */
                        if ((*p >= 0x80)
                            && (*p <= 0x9b)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0x9c: /* Latin ext */
                        if ((((*p >= 0xa2)
                            && (*p <= 0xaf))
                            || ((*p >= 0xb2)
                                && (*p <= 0xbf)))
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0x9d: /* Latin ext */
                        if ((((*p >= 0x80)
                            && (*p <= 0xaf))
                            && (!(*p % 2))) /* Even */
                            || (*p == 0xb9)
                            || (*p == 0xbb)
                            || (*p == 0xbe))
                            (*p)++; /* Next char is lwr */
                        else if (*p == 0xbd) {
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0xb5;
                            *(p) = 0xb9;
                        }
                        break;
                    case 0x9e: /* Latin ext */
                        if (((((*p >= 0x80)
                            && (*p <= 0x87))
                            || ((*p >= 0x96)
                                && (*p <= 0xa9))
                            || ((*p >= 0xb4)
                                && (*p <= 0xbf)))
                            && (!(*p % 2))) /* Even */
                            || (*p == 0x8b)
                            || (*p == 0x90)
                            || (*p == 0x92))
                            (*p)++; /* Next char is lwr */
                        else if (*p == 0xb3) {
                            *(p - 2) = 0xea;
                            *(p - 1) = 0xad;
                            *(p) = 0x93;
                        }
                        /* case 0x8d: // Two byte small 0xc9 0xa5 */
                        /* case 0xaa: // Two byte small 0xc9 0xa6 */
                        /* case 0xab: // Two byte small 0xc9 0x9c */
                        /* case 0xac: // Two byte small 0xc9 0xa1 */
                        /* case 0xad: // Two byte small 0xc9 0xac */
                        /* case 0xae: // Two byte small 0xc9 0xaa */
                        /* case 0xb0: // Two byte small 0xca 0x9e */
                        /* case 0xb1: // Two byte small 0xca 0x87 */
                        /* case 0xb2: // Two byte small 0xca 0x9d */
                        break;
                    case 0x9f: /* Latin ext */
                        if ((*p == 0x82)
                            || (*p == 0x87)
                            || (*p == 0x89)
                            || (*p == 0xb5))
                            (*p)++; /* Next char is lwr */
                        else if (*p == 0x84) {
                            *(p - 2) = 0xea;
                            *(p - 1) = 0x9e;
                            *(p) = 0x94;
                        }
                        else if (*p == 0x86) {
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0xb6;
                            *(p) = 0x8e;
                        }
                        /* case 0x85: // Two byte small 0xca 0x82 */
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xef: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0xbc: /* Latin fullwidth */
                        if ((*p >= 0xa1)
                            && (*p <= 0xba)) {
                            *pExtChar = 0xbd;
                            (*p) -= 0x20;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xf0: /* Four byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x90:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0x90: /* Deseret */
                            if ((*p >= 0x80)
                                && (*p <= 0x97)) {
                                (*p) += 0x28;
                            }
                            else if ((*p >= 0x98)
                                && (*p <= 0xa7)) {
                                *pExtChar = 0x91;
                                (*p) -= 0x18;
                            }
                            break;
                        case 0x92: /* Osage  */
                            if ((*p >= 0xb0)
                                && (*p <= 0xbf)) {
                                *pExtChar = 0x93;
                                (*p) -= 0x18;
                            }
                            break;
                        case 0x93: /* Osage  */
                            if ((*p >= 0x80)
                                && (*p <= 0x93))
                                (*p) += 0x28;
                            break;
                        case 0xb2: /* Old hungarian */
                            if ((*p >= 0x80)
                                && (*p <= 0xb2))
                                *pExtChar = 0xb3;
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x91:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xa2: /* Warang citi */
                            if ((*p >= 0xa0)
                                && (*p <= 0xbf)) {
                                *pExtChar = 0xa3;
                                (*p) -= 0x20;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x96:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xb9: /* Medefaidrin */
                            if ((*p >= 0x80)
                                && (*p <= 0x9f)) {
                                (*p) += 0x20;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x9E:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xA4: /* Adlam */
                            if ((*p >= 0x80)
                                && (*p <= 0x9d))
                                (*p) += 0x22;
                            else if ((*p >= 0x9e)
                                && (*p <= 0xa1)) {
                                *(pExtChar) = 0xa5;
                                (*p) -= 0x1e;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                default:
                    break;
                }
                pExtChar = 0;
            }
            p++;
        }
    }
    return pString;
}
int StrnCiCmp(const char* s1, const char* s2, size_t ztCount)
{
unsigned char* pStr1Low = 0;
unsigned char* pStr2Low = 0;
unsigned char* p1 = 0; 
unsigned char* p2 = 0; 

    if (s1 && *s1 && s2 && *s2) {
        char cExtChar = 0;
        pStr1Low = (unsigned char*)calloc(strlen(s1) + 1, sizeof(unsigned char));
        if (pStr1Low) {
            pStr2Low = (unsigned char*)calloc(strlen(s2) + 1, sizeof(unsigned char));
            if (pStr2Low) {
                p1 = pStr1Low;
                p2 = pStr2Low;
                strcpy((char*)pStr1Low, s1);
                strcpy((char*)pStr2Low, s2);
                StrToLwrExt(pStr1Low);
                StrToLwrExt(pStr2Low);
                for (; ztCount--; p1++, p2++) {
                    int iDiff = *p1 - *p2;
                    if (iDiff != 0 || !*p1 || !*p2) {
                        free(pStr1Low);
                        free(pStr2Low);
                        return iDiff;
                    }
                }
                free(pStr1Low);
                free(pStr2Low);
                return 0;
            }
            free(pStr1Low);
            return (-1);
        }
        return (-1);
    }
    return (-1);
}
int StrCiCmp(const char* s1, const char* s2)
{
    return StrnCiCmp(s1, s2, (size_t)(-1));
}
char* StrCiStr(const char* s1, const char* s2)
{
char* p = (char*)s1;
size_t len = 0; 

    if (s1 && *s1 && s2 && *s2) {
        len = strlen(s2);
        while (*p) {
            if (StrnCiCmp(p, s2, len) == 0)
                return (char*)p;
            p++;
        }
    }
    return (0);
}

Solution 3

These case insensitive features are definitely needed in search facilities.

Well, I have the same need as described above and UTF8 is pretty smooth in most ways, but the upper and lower case situations is not that great. Looks like it fall off the todo list when done? Because it has been in the past one of the major topics on the todo list in such cases. I have been patching IBM keyboard driver 1984 before IBM shipped, but copies were available. Also patched Displaywrite 1 and 3 (PC-DOS wordprocessor) before IBM wanted to ship in Europe. Done an awful lot of PC-DOS (CP850) and CP1252 (Windows) to and from national EBCDIC Code pages in IBM 3270 mainframe terminal systems. Them all had this case sensitivity topic on the todo list. In all national ASCII versions and the CP1252 Windows tables had a shift between the 0x40-0x5F and 0x60-0x7F to flip between lower and higher cases (but not PCDOS CP850), by 0x20.

What to do about it?

The tolower() and toupper() will not work in UTF8 multi character strings, outside US-ASCII. They are only working with one byte. But a string solution would work, and there are solutions for about everything else.

Western Europeans are lucky

Well the UTF8 put the CP1252 (Windows 8bit/Latin1) as the first additional table, Latin-1 Supplement (Unicode block), as is. This means that it is possible to shift the Letters (C3XX) like regular US ASCII. Code sample below.

Greeks, Russians, Icelanders and Eastern Europeans are not that lucky

For the Icelanders the Đ/đ - D with stroke (same as the th sound of the word the) is just punched out from CP1252.

The Greeks, Russians and Eastern Europeans ISO8-charsets (CP1253, CP1251 and CP1257) could have been used (as the latin CP1252 was directly used). Then just shifting would also have worked. But instead someone just filled the table pretty randomly (like in the PC-DOC 8-bit ASCII).

There is only one working solution, the same as for PC_DOS ASCII, make translation-tables. I will do it for next X-mas (when I need it bad) but I hint how to do it if someone else is in a hurry.

How to do solutions for the Greeks, Russians, Icelanders and Eastern Europeans

Make different tables relating to the different first byte of the UTF8-table for Eastern Europe, Greek and Cyrillic in the programming code. Fill the tables with the second byte of the letters in its UTF8 second byte positions and exchange the uppercase letters with the matching second byte of the lower cases, and make another one doing the other way around.

Then identify what first byte that relates to each table. That way the programming code can select the right table and just read the right position and get the upper or lower case characters needed. Then modify the letter case functions below (those I have made for Latin1), to use tables instaed of shifting 0x20, for some first UTF8-characters, where tables must be used. It will work smooth and new computers have no problem with memory and power.

UTF8 letter case related functions Latin1 samples

This is working I believe, tried it yet shortly. It only works in Latin-1, and USACII parts of the UTF8.

unsigned char *StrToLwrUft8Latin1(unsigned char *pString)
{
    char cExtChar = 0;
    if (pString && *pString) {
        unsigned char *p = pString;
        while (*p) {
            if (((cExtChar && ((*p >= 0x80) && (*p <= 0xbf)))
                || ((!cExtChar) && (*p <= 0x7f)))
                && ((((*p & 0x7f) + cExtChar) >= 0x40)
                    && (((*p & 0x7f) + cExtChar) <= 0x5f)))
                *p += 0x20;
            if (cExtChar)
                cExtChar = 0;
            else if (*p == 0xc3)
                cExtChar = 0x40;
            p++;
        }
    }
    return pString;
}
unsigned char *StrToUprUft8Latin1(unsigned char *pString)
{
    char cExtChar = 0;
    if (pString && *pString) {
        unsigned char *p = pString;
        while (*p) {
            if (((cExtChar && ((*p >= 0x80) && (*p <= 0xbf)))
                || ((!cExtChar) && (*p <= 0x7f)))
                && ((((*p & 0x7f) + cExtChar) >= 0x60)
                    && (((*p & 0x7f) + cExtChar) <= 0x7e)))
                *p -= 0x20;
            if (cExtChar)
                cExtChar = 0;
            else if (*p == 0xc3)
                cExtChar = 0x40;
            p++;
        }
    }
    return pString;
}
int StrnCiCmpLatin1(const char *s1, const char *s2, size_t ztCount)
{
    unsigned char cExtChar = 0;
    if (s1 && *s1 && s2 && *s2) {
        for (; ztCount--; s1++, s2++) {
            int iDiff = tolower((unsigned char)(*s1 & 0x7f)
                + cExtChar) - tolower((unsigned char)(*s2 & 0x7f) + cExtChar);
            if (iDiff != 0 || !*s1 || !*s2)
                return iDiff;
            if (cExtChar)
                cExtChar = 0;
            else if (((unsigned char )*s2) == ((unsigned char)0xc3))
                cExtChar = 0x40;
        }
    }
    return 0;
}
int StrCiCmpLatin1(const char *s1, const char *s2)
{
    return StrnCiCmpLatin1(s1, s2, (size_t)(-1));
}
char *StrCiStrLatin1(const char *s1, const char *s2)
{
    if (s1 && *s1 && s2 && *s2) {
        char *p = (char *)s1;
        size_t len = strlen(s2);
        while (*p) {
            if (StrnCiCmpLatin1(p, s2, len) == 0)
                return p;
            p++;
        }
    }
    return (0);
}

Solution 4

There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8.

The article within (utf8everywhere) and answers apply to Windows. The C++ standard requires that wchar_t be wide enough to accomodate all supported code units (32-bits wide) but works perfectly fine with UTF-8. On Windows, wchar_t is UTF-16 but if you're on Windows you have more problems than just that if we're going to be honest (namely their horrifying API).

It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.

Not really. Set the locale inside the code. Some programs like sort don't work properly if you don't set the locale inside the shell for example, so the onus on the user.

I was expecting to just use something like std::toupper(), but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string.

The code example uses iterators. If you don't want to convert every character, don't.

Also, this Ideone example I put together seems to show that toupper() of 0xc3b3 is just 0xc3b3, which is an unexpected result. Calling setlocale to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.

You have undefined behavior. The range of unsigned char is 255. 0xc3b3 way surpasses that.

I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!

This example works perfectly fine:

#include <iostream>
#include <string>
#include <locale>

int main()
{
    std::setlocale(LC_CTYPE, "en_US.UTF-8"); // the locale will be the UTF-8 enabled English

    std::wstring str = L"óó";

    std::wcout << str << std::endl;

    for (std::wstring::iterator it = str.begin(); it != str.end(); ++it)
        *it = towupper(*it);

    std::wcout << str << std::endl;
}

Outputs: ÓÓ

Solution 5

This code is a function set of verified UTF (UTF8, UTF16 and UTF32) Lwr/Upr conversion and case-insensitive Cmp, strstr with processing using UTF code point reference ids.

Download at: https://www.alphabet.se/download/UtfConv.c

The function set is:

// Utf 8
size_t StrLenUtf8(const Utf8Char* str);
int StrnCmpUtf8(const Utf8Char* Utf8s1, const Utf8Char* Utf8s2, size_t ztCount);
int StrCmpUtf8(const Utf8Char* Utf8s1, const Utf8Char* Utf8s2);
size_t CharLenUtf8(const Utf8Char* pUtf8);
Utf8Char* ForwardUtf8Chars(const Utf8Char* pUtf8, size_t ztForwardUtf8Chars);
size_t StrLenUtf32AsUtf8(const Utf32Char* pUtf32);
Utf8Char* Utf32ToUtf8(const Utf32Char* pUtf32);
Utf32Char* Utf8ToUtf32(const Utf8Char* pUtf8);
Utf16Char* Utf8ToUtf16(const Utf8Char* pUtf8);
Utf8Char* Utf8StrMakeUprUtf8Str(const Utf8Char* pUtf8);
Utf8Char* Utf8StrMakeLwrUtf8Str(const Utf8Char* pUtf8);
int StrnCiCmpUtf8(const Utf8Char* pUtf8s1, const Utf8Char* pUtf8s2, size_t ztCount);
int StrCiCmpUtf8(const Utf8Char* pUtf8s1, const Utf8Char* pUtf8s2);
Utf8Char* StrCiStrUtf8(const Utf8Char* pUtf8s1, const Utf8Char* pUtf8s2);

// Utf 16
size_t StrLenUtf16(const Utf16Char* str);
Utf16Char* StrCpyUtf16(Utf16Char* dest, const Utf16Char* src);
Utf16Char* StrCatUtf16(Utf16Char* dest, const Utf16Char* src);
int StrnCmpUtf16(const Utf16Char* Utf16s1, const Utf16Char* Utf16s2, size_t ztCount);
int StrCmpUtf16(const Utf16Char* Utf16s1, const Utf16Char* Utf16s2);
size_t CharLenUtf16(const Utf16Char* pUtf16);
Utf16Char* ForwardUtf16Chars(const Utf16Char* pUtf16, size_t ztForwardUtf16Chars);
size_t StrLenUtf32AsUtf16(const Utf32Char* pUtf32);
Utf16Char* Utf32ToUtf16(const Utf32Char* pUtf32);
Utf32Char* Utf16ToUtf32(const Utf16Char* pUtf16);
Utf8Char* Utf16ToUtf8(const Utf16Char* pUtf16);
Utf16Char* Utf16StrMakeUprUtf16Str(const Utf16Char* pUtf16);
Utf16Char* Utf16StrMakeLwrUtf16Str(const Utf16Char* pUtf16);
int StrnCiCmpUtf16(const Utf16Char* pUtf16s1, const Utf16Char* pUtf16s2, size_t ztCount);
int StrCiCmpUtf16(const Utf16Char* pUtf16s1, const Utf16Char* pUtf16s2);
Utf16Char* StrCiStrUtf16(const Utf16Char* pUtf16s1, const Utf16Char* pUtf16s2);

// Utf 32
size_t StrLenUtf32(const Utf32Char* str);
Utf32Char* StrCpyUtf32(Utf32Char* dest, const Utf32Char* src);
Utf32Char* StrCatUtf32(Utf32Char* dest, const Utf32Char* src);
int StrnCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2, size_t ztCount);
int StrCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2);
Utf32Char* StrToUprUtf32(Utf32Char* pUtf32);
Utf32Char* StrToLwrUtf32(Utf32Char* pUtf32);
int StrnCiCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2, size_t ztCount);
int StrCiCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2);
Utf32Char* StrCiStrUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2);

After reading comments to the ”This code is a carefully tested UTF8 case conversion/case insensitive cmp.” answer and other comments in answers here I made a solution

  • Converts UTF8 and UTF16 strings to UTF32 (UTF code point reference ids)
  • Processing in UTF32, Lwr/Upr-converts 1361 characters back and forth
  • Converts back UTF32 strings back to UTF8 and UTF16 strings
    • With sync control when the converted string has the same number of characters but different number of bytes, pointing at the right character
  • It works with any datatype definitions (are different in different OS)
    • Define statement at the top of the code
    • Utf8Char with signed/unsigned 1 byte 8 bit (unsigned char is default)
    • Utf16Char with at least signed/unsigned 2 byte (wchar_t is default)
    • Utf32Char with at least signed/unsigned 4 byte (uint32_t is default)

The advantages in relation to the ”This code is a carefully tested UTF8 case conversion/case insensitive cmp.” answer are:

  • It is proper programming style (separating string processing and encoding)
  • The full package, a full UTF strings programming tool kit
  • Handles also pairs of different number of UTF8 and UTF16 bytes correct
  • Much better readability of the source code
  • Less risk of bugs, conversions are written by a program reading the UTF definition table
  • Applicable for other UTF16 and UTF32 encoding

The dis-advantages are:

  • The switch cases are many, is less good for performance (not a big thing these days unless huge data volumes to convert)
  • Code is much longer, do not fit in a Stackoverflow anser and have to use a web-link
Share:
12,787
aardvarkk
Author by

aardvarkk

GitHub

Updated on June 15, 2022

Comments

  • aardvarkk
    aardvarkk almost 2 years

    Let's imagine I have a UTF-8 encoded std::string containing the following:

    óó

    and I'd like to convert it to the following:

    ÓÓ

    Ideally I want the uppercase/lowercase approach I'm using to be generic across all of UTF-8. If that's even possible.

    The original byte sequence in the string is 0xc3b3c3b3 (two bytes per character, and two instances of ó) and I'd like the output to be 0xc393c393 (two instances of Ó). There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8. It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.

    I was expecting to just use something like std::toupper(), but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string. Also, this Ideone example I put together seems to show that toupper() of 0xc3b3 is just 0xc3b3, which is an unexpected result. Calling setlocale to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.

    I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!