How to convert an instance of std::string to lower case

1,119,787

Solution 1

Adapted from Not So Frequently Asked Questions:

#include <algorithm>
#include <cctype>
#include <string>

std::string data = "Abc";
std::transform(data.begin(), data.end(), data.begin(),
    [](unsigned char c){ return std::tolower(c); });

You're really not going to get away without iterating through each character. There's no way to know whether the character is lowercase or uppercase otherwise.

If you really hate tolower(), here's a specialized ASCII-only alternative that I don't recommend you use:

char asciitolower(char in) {
    if (in <= 'Z' && in >= 'A')
        return in - ('Z' - 'z');
    return in;
}

std::transform(data.begin(), data.end(), data.begin(), asciitolower);

Be aware that tolower() can only do a per-single-byte-character substitution, which is ill-fitting for many scripts, especially if using a multi-byte-encoding like UTF-8.

Solution 2

Boost provides a string algorithm for this:

#include <boost/algorithm/string.hpp>

std::string str = "HELLO, WORLD!";
boost::algorithm::to_lower(str); // modifies str

Or, for non-in-place:

#include <boost/algorithm/string.hpp>

const std::string str = "HELLO, WORLD!";
const std::string lower_str = boost::algorithm::to_lower_copy(str);

Solution 3

tl;dr

Use the ICU library. If you don't, your conversion routine will break silently on cases you are probably not even aware of existing.


First you have to answer a question: What is the encoding of your std::string? Is it ISO-8859-1? Or perhaps ISO-8859-8? Or Windows Codepage 1252? Does whatever you're using to convert upper-to-lowercase know that? (Or does it fail miserably for characters over 0x7f?)

If you are using UTF-8 (the only sane choice among the 8-bit encodings) with std::string as container, you are already deceiving yourself if you believe you are still in control of things. You are storing a multibyte character sequence in a container that is not aware of the multibyte concept, and neither are most of the operations you can perform on it! Even something as simple as .substr() could result in invalid (sub-) strings because you split in the middle of a multibyte sequence.

As soon as you try something like std::toupper( 'ß' ), or std::tolower( 'Σ' ) in any encoding, you are in trouble. Because 1), the standard only ever operates on one character at a time, so it simply cannot turn ß into SS as would be correct. And 2), the standard only ever operates on one character at a time, so it cannot decide whether Σ is in the middle of a word (where σ would be correct), or at the end (ς). Another example would be std::tolower( 'I' ), which should yield different results depending on the locale -- virtually everywhere you would expect i, but in Turkey ı (LATIN SMALL LETTER DOTLESS I) is the correct answer (which, again, is more than one byte in UTF-8 encoding).

So, any case conversion that works on a character at a time, or worse, a byte at a time, is broken by design. This includes all the std:: variants in existence at this time.

Then there is the point that the standard library, for what it is capable of doing, is depending on which locales are supported on the machine your software is running on... and what do you do if your target locale is among the not supported on your client's machine?

So what you are really looking for is a string class that is capable of dealing with all this correctly, and that is not any of the std::basic_string<> variants.

(C++11 note: std::u16string and std::u32string are better, but still not perfect. C++20 brought std::u8string, but all these do is specify the encoding. In many other respects they still remain ignorant of Unicode mechanics, like normalization, collation, ...)

While Boost looks nice, API wise, Boost.Locale is basically a wrapper around ICU. If Boost is compiled with ICU support... if it isn't, Boost.Locale is limited to the locale support compiled for the standard library.

And believe me, getting Boost to compile with ICU can be a real pain sometimes. (There are no pre-compiled binaries for Windows that include ICU, so you'd have to supply them together with your application, and that opens a whole new can of worms...)

So personally I would recommend getting full Unicode support straight from the horse's mouth and using the ICU library directly:

#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/locid.h>

#include <iostream>

int main()
{
    /*                          "Odysseus" */
    char const * someString = u8"ΟΔΥΣΣΕΥΣ";
    icu::UnicodeString someUString( someString, "UTF-8" );
    // Setting the locale explicitly here for completeness.
    // Usually you would use the user-specified system locale,
    // which *does* make a difference (see ı vs. i above).
    std::cout << someUString.toLower( "el_GR" ) << "\n";
    std::cout << someUString.toUpper( "el_GR" ) << "\n";
    return 0;
}

Compile (with G++ in this example):

g++ -Wall example.cpp -licuuc -licuio

This gives:

ὀδυσσεύς

Note that the Σ<->σ conversion in the middle of the word, and the Σ<->ς conversion at the end of the word. No <algorithm>-based solution can give you that.

Solution 4

Using range-based for loop of C++11 a simpler code would be :

#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

int main ()
{
  std::locale loc;
  std::string str="Test String.\n";

 for(auto elem : str)
    std::cout << std::tolower(elem,loc);
}

Solution 5

If the string contains UTF-8 characters outside of the ASCII range, then boost::algorithm::to_lower will not convert those. Better use boost::locale::to_lower when UTF-8 is involved. See http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html

Share:
1,119,787
uss
Author by

uss

About me.. check out my blog. :)

Updated on February 04, 2022

Comments

  • uss
    uss about 2 years

    I want to convert a std::string to lowercase. I am aware of the function tolower(). However, in the past I have had issues with this function and it is hardly ideal anyway as using it with a std::string would require iterating over each character.

    Is there an alternative which works 100% of the time?

  • tenstormavi
    tenstormavi over 15 years
    That is amazing, ive always wondered what the best way to do it. I had no idea to use std::transform. :)
  • Stefan Mai
    Stefan Mai over 15 years
    uberjumper: There's actually a whole lot of overhead associated with the STL calls, especially for small"ish" strings. Solutions using a for loop and tolower are probably much faster.
  • eq-
    eq- over 12 years
    (Old it may be, the algorithms in question have changed little) @Stefan Mai: What kind of "whole lot of overhead" is there in calling STL algorithms? The functions are rather lean (i.e. simple for loops) and often inlined as you rarely have many calls to the same function with the same template parameters in the same compile unit.
  • Stefan Mai
    Stefan Mai over 12 years
    @eq Fair point, my benchmarks agree with you when compiling with -O3 (though the STL actually outperforms the more hand-tuned code so I'm wondering whether the compiler is pulling some tricks). Debugging STL code is still a bear though ;).
  • incises
    incises over 10 years
    However, on a french machine, this program doesn't convert non ASCII characters allowed in the french language. For instance a string 'Test String123. É Ï\n' will be converted to : 'test string123. É Ï\n' although characters É Ï and their lower case couterparts 'é' and 'ï', are allowed in french. It seems that no solution for that was provided by other messages of this thread.
  • user1095108
    user1095108 over 10 years
    I think you need to set a proper locale for that.
  • Michal W
    Michal W over 10 years
    This non portable solution could be faster. You can avoid branch it this way: inChar |= 0x20. I think it is the fastest way to convert ascii upper to lower. If u want to convert lower to upper then: inChar &= ~0x20.
  • Stefan Mai
    Stefan Mai about 10 years
    @MichalW This works if you have only letters, which isn't always the case. If you're in that realm, you can probably do even better by using bitmasks on longs -- take on 8 characters at a time ;)
  • Rag
    Rag about 10 years
    Every time you assume characters are ASCII, God kills a kitten. :(
  • juanchopanza
    juanchopanza almost 10 years
    Your first example potentially has undefined behaviour (passing char to ::tolower(int).) You need to ensure you don't pass a negative value.
  • Cheers and hth. - Alf
    Cheers and hth. - Alf almost 10 years
    -1 this use of ::tolower may well crash, it's UB for non-ASCII input.
  • pavon
    pavon almost 10 years
    While this should would be the canonical way to do this in a sane world, it has too many problems to recommend it. First, tolower from ctype.h doesn't work with unicode. Secondly, locale.h which is included by many of the other std library headers, defines a conflicting tolower, that causes headaches, see stackoverflow.com/q/5539249/339595. It is best to use std::locale or boost::locale::to_lower as other answers suggest.
  • Gábor Buella
    Gábor Buella over 9 years
    "There is a way to convert upper case to lower WITHOUT doing if tests" ever heard of lookup tables?
  • DevSolar
    DevSolar about 9 years
    Fails for non-ASCII-7.
  • lmat - Reinstate Monica
    lmat - Reinstate Monica about 9 years
    This is the correct answer in the general case. The standard gives nothing for handling anything except "ASCII" except lies and deception. It makes you think you can maybe deal with maybe UTF-16, but you can't. As this answer says, you cannot get the proper character-length (not byte-length) of a UTF-16 string without doing your own unicode handling. If you have to deal with real text, use ICU. Thanks, @DevSolar
  • Purefan
    Purefan over 8 years
    This did not resize Ä into ä for me
  • NathanTempelman
    NathanTempelman about 8 years
    ::towlower if you're being international/using wide chars
  • BugShotGG
    BugShotGG about 8 years
    @MichalW Hey, can you explain what you wrote there? Also, why do we use :: in ::tolower ?
  • Shital Shah
    Shital Shah almost 8 years
    Is ICU available by default on Ubuntu/Windows or needs to be install separately? Also how about this answer:stackoverflow.com/a/35075839/207661?
  • Luis Paulo
    Luis Paulo almost 8 years
    @StefanMai Hi. Why is the "::" needed before "tolower"? I don't understand that.
  • Dan
    Dan almost 8 years
    Note that this works for Unicode if you're using a std::u32string and your C locale is compatible with Unicode.
  • Charles Ofria
    Charles Ofria almost 8 years
    The :: is needed before tolower to indicate that it is in the outermost namespace. If you use this code in another namespace, there may be a different (possibly unrelated) definition of tolower which would end up being preferentially selected without the ::.
  • quazar
    quazar over 7 years
    Nice, as long as you can convert the characters in place. What if your source string is const? That seems to make it a bit more messy (e.g. it doesn't look like you can use f.tolower() ), since you need to put the characters in a new string. Would you use transform() and something like std::bind1st( std::mem_fun() ) for the operator?
  • Sameer
    Sameer over 7 years
    For a const string, we can just make a local copy and then convert it in place.
  • Alexis Wilke
    Alexis Wilke over 7 years
    @incises, this then someone posted an answer about ICU and that's certainly the way to go. Easier than most other solutions that would attempt to understand the locale.
  • quazar
    quazar over 7 years
    Yeah, though, making a copy adds more overhead.
  • chili
    chili about 7 years
    Could also use a back inserter iterator here instead of manual resize.
  • chili
    chili about 7 years
    You could use std::transform with the version of ctype::tolower that does not take pointers. Use a back inserter iterator adapter and you don't even need to worry about pre-sizing your output string.
  • Arne Vogel
    Arne Vogel almost 7 years
    Great, especially because in libstdc++'s tolower with locale parameter, the implicit call to use_facet appears to be a performance bottleneck. One of my coworkers has achieved a several 100% speed increase by replacing boost::iequals (which has this problem) with a version where use_facet is only called once outside of the loop.
  • masaers
    masaers almost 7 years
    icu::UnicodeString::length() is technically also lying to you (although less frequently), as it reports the number of 16bit code units rather than the number of code points. ;-)
  • DevSolar
    DevSolar almost 7 years
    @masaers: To be completely fair, with things like combining characters, zero-width joiners and right-to-left markers, the number of code points is rather meaningless. I will remove that remark.
  • masaers
    masaers almost 7 years
    @DevSolar Agreed! The concept of length is rather meaningless on text (we could add ligatures to the list of offenders). That said, since people are used to tabs and control chars taking up one length unit, code points would be the more intuitive measure. Oh, and thanks for giving the correct answer, sad to see it so far down :-(
  • kayleeFrye_onDeck
    kayleeFrye_onDeck almost 7 years
    I'd prefer to not use external libraries when possible, personally.
  • 8.8.8.8
    8.8.8.8 over 6 years
    std::transform(data.begin(), data.end(), data.begin(), easytolower); is dangerous. Since the behavior of std::tolower undefined if the input is not representable as unsigned char and is not equal to EOF
  • Clearer
    Clearer over 6 years
    I'm downvoting this for giving macros when a perfectly good solution exist -- you even give those solutions.
  • Volomike
    Volomike over 6 years
    The macro technique means less typing of code for something that one would commonly use a lot in programming. Why not use that? Otherwise, why have macros at all?
  • Clearer
    Clearer over 6 years
    Macros are a legacy from C that's being worked hard on to get rid of. If you want to reduce the amount of typing, use a function or a lambda. void strtoupper(std::string& x) { std::transform (x.begin(), x.end(), x.begin(), ::toupper); }
  • Volomike
    Volomike over 6 years
    @Clearer As I want to be a better coder, can you provide me any ANSI doc links where any ANSI C++ committees say something to the effect of, "We need to call a meeting to get rid of macros out of C++"? Or some other roadmap?
  • Clearer
    Clearer over 6 years
    No, I can't. Bjarne's stance on the topic has been made pretty clear on several occasions though. Besides, there are plenty of reasons to not use macros in C as well as C++. x could be a valid expression, that just happens to compile correctly but will give completely bogus results because of the macros.
  • T.E.D.
    T.E.D. over 6 years
    @BrianGordon - But its much easier, and there really are way too many cats in the world already.
  • Roland Illig
    Roland Illig over 6 years
    Undefined behavior for negative chars.
  • Cort Ammon
    Cort Ammon over 6 years
    @BrianGordon That is blatantly false, as proven by the fact that there are still kittens in the world! =)
  • TypicalHog
    TypicalHog about 6 years
    What makes the 2nd solution non-portable? Can I just do this? pastebin.com/MPRMpQJS
  • bfontaine
    bfontaine about 6 years
    Please clarify where did you copy your answer from.
  • Alnitak
    Alnitak almost 6 years
    @BrianGordon there are also cases when you know that the input is ASCII (e.g. the wire format of domain names).
  • Rag
    Rag almost 6 years
    @Alnitak I didn't know that. How does DNS handle international domain names which can be in unicode?
  • Alnitak
    Alnitak almost 6 years
    @BrianGordon applications have to convert them into an all-ASCII encoding called "Punycode" (RFC 3492)
  • Aquarius Power
    Aquarius Power almost 6 years
    good macros! @Clearer macros help us so much... I expect they never get rid of it.
  • Clearer
    Clearer almost 6 years
    @AquariusPower I disagree. I have yet to see a macro that could not have been done better as a template or a lambda.
  • DevSolar
    DevSolar over 5 years
    @TypicalHog: Because there is no guarantee that 'A' to 'Z' is a continuous range (EBCDIC); but more importantly because there are letters outside that range ('Ü', 'á', ...). It's very, very sad that the authors prefer to harvest more upvotes for answers with non-portable solutions instead of properly pointing out their shortcomings...
  • Violet Giraffe
    Violet Giraffe over 5 years
    @DevSolar: easytolower seems a perfectly valid solution for latin ASCII symbols to me. Going to use it for normalizing HTML tag names.
  • Pavel P
    Pavel P over 5 years
    @Cheersandhth.-Alf c99 doesn't mention that it's UB: it either returns lower char, or unmodified. std::tolower, however, mentions ub
  • Deduplicator
    Deduplicator almost 5 years
    @L.F. I fixed your fix.
  • L. F.
    L. F. almost 5 years
    @Deduplicator To be honest, I have always been having trouble understanding why the char has to be converted unsigned char first. Isn't the value of a (signed) char supposed to be nonnegative, anyway? What is the point of tolowering a negative char? I guess I am missing the point, so would you mind explaining it to be a little bit please :)
  • Deduplicator
    Deduplicator almost 5 years
    @L.F. No, char can be analogous to signed char, and a signed char can be negative. tolower only accepts unsigned char and -1. Anything outside its domain is UB, and you don't want to conflate with -1 either. While all members of the basic execution character set are non-negative, that does not necessarily hold for the (complete) execution character set. See the current draft.
  • L. F.
    L. F. almost 5 years
    @Deduplicator Thank you! I didn't know a char can validly be negative. But then, doesn't converting to unsigned char just change the value?
  • Deduplicator
    Deduplicator almost 5 years
    @L.F. char -> unsigned char (value-preserving, modulo 2**CHAR_BIT) -> implicit to int (value-preserving). Of course, if sizeof(int) == 1, things pretty much fall apart.
  • L. F.
    L. F. almost 5 years
    @Deduplicator OK ... I think I missed that ... Then the int is converted to char, I think, so the resulting value is implementation-defined before C++20 and guaranteed to be the original value since C++20?
  • Deduplicator
    Deduplicator almost 5 years
    @L.F. Converting the result from tolower() (int) back to char is also an interesting story, yes.
  • JPhi1618
    JPhi1618 over 4 years
    I don't understand why the tolower here is wrapped in a lambda rather than just passing it to transform on its own.
  • L. F.
    L. F. about 4 years
    @JPhi1618 1) to make sure that the character is first converted to unsigned char (see Deduplicator's comments above); 2) to enable overload resolution to select the int tolower( int ch ); overload defined in <cctype> instead of the template< class charT > charT tolower( charT ch, const locale& loc ); overload defined in <clocale>.
  • Contango
    Contango about 4 years
    Modern CPUs are bottlenecked in memory not CPU. Benchmarking would be interesting.
  • Juv
    Juv about 4 years
    This is what I needed. I just used the towlower for wide characters which supports the UTF-16.
  • CCJ
    CCJ almost 4 years
    happily coding in Java and the time comes to switch over to a CPP module... comes along a simple string case issue Me: "I'll just look up the std::string toLower() or whatever the standard has for normalizing text case... Hmm, I wonder how they handle all the encoding and localization complexities a 'simple' task like that could entail when std::string is just raw text data?" finds this question... sad requiring that ingest data follows a case convention noises
  • Deduplicator
    Deduplicator over 3 years
    Actually, std::string not being aware that it contains text in a multi-byte character-encoding is a feature, not a bug. It's the only sane way to do it, which is why just about everyone does it. Not having proper standard apis for handling anything but basic text from days gone by which never really were at all is a problem though, yes. It would have to be optional even in a hosted environment though, as it is quite hefty, and there are many cases where it isn't needed.
  • DevSolar
    DevSolar over 3 years
    @Deduplicator: Sorry, but that's just dodging it in all possible ways. There are standards (Unicode), there are quasi-standard APIs for handling it (ICU), and if your intention is to write code that properly converts text to lowercase, unless you can guarantee your code will only ever see ASCII-7 (which would be a rather special case), all the other "solutions" here are 80--20 at best.
  • Deduplicator
    Deduplicator over 3 years
    That is why there should be such standard APIs. Doesn't negate the fact that much string-manipulation is best done ignoring all but it being a sequence of code-units. And that many use-cases never need anything more sophisticated.
  • DevSolar
    DevSolar over 3 years
    @Deduplicator And that standard API is currently the ICU library, which is what this answer is about.
  • DevSolar
    DevSolar about 3 years
    @Deduplicator I heard that std::text is underway, perhaps even in time for C++23. Let's not give up all hope yet.
  • Rodrigo
    Rodrigo about 3 years
    I guess it won't work for UTF-8, will it?
  • prehistoricpenguin
    prehistoricpenguin almost 3 years
    This function is slow, shouldn't be used in real-life projects.
  • prehistoricpenguin
    prehistoricpenguin almost 3 years
    This is pretty slow, see this benchmark: godbolt.org/z/neM5jsva1
  • Velkan
    Velkan over 2 years
    A working example?
  • Dmitry Grigoryev
    Dmitry Grigoryev over 2 years
    This is plain wrong: if you check the documentation, you will see that std::tolower cannot work with char, it only supports unsigned char. So this code is UB if str contains characters outside of 0x00-0x7F.
  • Dmitry Grigoryev
    Dmitry Grigoryev over 2 years
    This won't work in Windows where you'd have to call std::locale("English_Unites States.UTF8").
  • JadeSpy
    JadeSpy about 2 years
    I don't think you need to wrap std::tolower in a lambda.
  • Mayou36
    Mayou36 about 2 years
    @prehistoricpenguin slow? Well, slow is to debug code because your own implementation has a bug because it was more complicated than to just call the boost library ;) If the code is critical, like called a lot and provides a bottleneck, then, well, it can be worth to think about slowness
  • Kiruahxh
    Kiruahxh almost 2 years
    icu::UnicodeString seem to be a good class. QString also can do the job. However it is a pain to use in big programs with many libraries. I hope std::text will be a real thing soon