How to convert an instance of std::string to lower case
Solution 1
Adapted from Not So Frequently Asked Questions:
#include <algorithm>
#include <cctype>
#include <string>
std::string data = "Abc";
std::transform(data.begin(), data.end(), data.begin(),
[](unsigned char c){ return std::tolower(c); });
You're really not going to get away without iterating through each character. There's no way to know whether the character is lowercase or uppercase otherwise.
If you really hate tolower()
, here's a specialized ASCII-only alternative that I don't recommend you use:
char asciitolower(char in) {
if (in <= 'Z' && in >= 'A')
return in - ('Z' - 'z');
return in;
}
std::transform(data.begin(), data.end(), data.begin(), asciitolower);
Be aware that tolower()
can only do a per-single-byte-character substitution, which is ill-fitting for many scripts, especially if using a multi-byte-encoding like UTF-8.
Solution 2
Boost provides a string algorithm for this:
#include <boost/algorithm/string.hpp>
std::string str = "HELLO, WORLD!";
boost::algorithm::to_lower(str); // modifies str
#include <boost/algorithm/string.hpp>
const std::string str = "HELLO, WORLD!";
const std::string lower_str = boost::algorithm::to_lower_copy(str);
Solution 3
tl;dr
Use the ICU library. If you don't, your conversion routine will break silently on cases you are probably not even aware of existing.
First you have to answer a question: What is the encoding of your std::string
? Is it ISO-8859-1? Or perhaps ISO-8859-8? Or Windows Codepage 1252? Does whatever you're using to convert upper-to-lowercase know that? (Or does it fail miserably for characters over 0x7f
?)
If you are using UTF-8 (the only sane choice among the 8-bit encodings) with std::string
as container, you are already deceiving yourself if you believe you are still in control of things. You are storing a multibyte character sequence in a container that is not aware of the multibyte concept, and neither are most of the operations you can perform on it! Even something as simple as .substr()
could result in invalid (sub-) strings because you split in the middle of a multibyte sequence.
As soon as you try something like std::toupper( 'ß' )
, or std::tolower( 'Σ' )
in any encoding, you are in trouble. Because 1), the standard only ever operates on one character at a time, so it simply cannot turn ß
into SS
as would be correct. And 2), the standard only ever operates on one character at a time, so it cannot decide whether Σ
is in the middle of a word (where σ
would be correct), or at the end (ς
). Another example would be std::tolower( 'I' )
, which should yield different results depending on the locale -- virtually everywhere you would expect i
, but in Turkey ı
(LATIN SMALL LETTER DOTLESS I) is the correct answer (which, again, is more than one byte in UTF-8 encoding).
So, any case conversion that works on a character at a time, or worse, a byte at a time, is broken by design. This includes all the std::
variants in existence at this time.
Then there is the point that the standard library, for what it is capable of doing, is depending on which locales are supported on the machine your software is running on... and what do you do if your target locale is among the not supported on your client's machine?
So what you are really looking for is a string class that is capable of dealing with all this correctly, and that is not any of the std::basic_string<>
variants.
(C++11 note: std::u16string
and std::u32string
are better, but still not perfect. C++20 brought std::u8string
, but all these do is specify the encoding. In many other respects they still remain ignorant of Unicode mechanics, like normalization, collation, ...)
While Boost looks nice, API wise, Boost.Locale is basically a wrapper around ICU. If Boost is compiled with ICU support... if it isn't, Boost.Locale is limited to the locale support compiled for the standard library.
And believe me, getting Boost to compile with ICU can be a real pain sometimes. (There are no pre-compiled binaries for Windows that include ICU, so you'd have to supply them together with your application, and that opens a whole new can of worms...)
So personally I would recommend getting full Unicode support straight from the horse's mouth and using the ICU library directly:
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/locid.h>
#include <iostream>
int main()
{
/* "Odysseus" */
char const * someString = u8"ΟΔΥΣΣΕΥΣ";
icu::UnicodeString someUString( someString, "UTF-8" );
// Setting the locale explicitly here for completeness.
// Usually you would use the user-specified system locale,
// which *does* make a difference (see ı vs. i above).
std::cout << someUString.toLower( "el_GR" ) << "\n";
std::cout << someUString.toUpper( "el_GR" ) << "\n";
return 0;
}
Compile (with G++ in this example):
g++ -Wall example.cpp -licuuc -licuio
This gives:
ὀδυσσεύς
Note that the Σ<->σ conversion in the middle of the word, and the Σ<->ς conversion at the end of the word. No <algorithm>
-based solution can give you that.
Solution 4
Using range-based for loop of C++11 a simpler code would be :
#include <iostream> // std::cout
#include <string> // std::string
#include <locale> // std::locale, std::tolower
int main ()
{
std::locale loc;
std::string str="Test String.\n";
for(auto elem : str)
std::cout << std::tolower(elem,loc);
}
Solution 5
If the string contains UTF-8 characters outside of the ASCII range, then boost::algorithm::to_lower will not convert those. Better use boost::locale::to_lower when UTF-8 is involved. See http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html
Comments
-
uss about 2 years
I want to convert a
std::string
to lowercase. I am aware of the functiontolower()
. However, in the past I have had issues with this function and it is hardly ideal anyway as using it with astd::string
would require iterating over each character.Is there an alternative which works 100% of the time?
-
tenstormavi over 15 yearsThat is amazing, ive always wondered what the best way to do it. I had no idea to use std::transform. :)
-
Stefan Mai over 15 yearsuberjumper: There's actually a whole lot of overhead associated with the STL calls, especially for small"ish" strings. Solutions using a for loop and tolower are probably much faster.
-
eq- over 12 years(Old it may be, the algorithms in question have changed little) @Stefan Mai: What kind of "whole lot of overhead" is there in calling STL algorithms? The functions are rather lean (i.e. simple for loops) and often inlined as you rarely have many calls to the same function with the same template parameters in the same compile unit.
-
Stefan Mai over 12 years@eq Fair point, my benchmarks agree with you when compiling with
-O3
(though the STL actually outperforms the more hand-tuned code so I'm wondering whether the compiler is pulling some tricks). Debugging STL code is still a bear though ;). -
incises over 10 yearsHowever, on a french machine, this program doesn't convert non ASCII characters allowed in the french language. For instance a string 'Test String123. É Ï\n' will be converted to : 'test string123. É Ï\n' although characters É Ï and their lower case couterparts 'é' and 'ï', are allowed in french. It seems that no solution for that was provided by other messages of this thread.
-
user1095108 over 10 yearsI think you need to set a proper locale for that.
-
Michal W over 10 yearsThis non portable solution could be faster. You can avoid branch it this way: inChar |= 0x20. I think it is the fastest way to convert ascii upper to lower. If u want to convert lower to upper then: inChar &= ~0x20.
-
Stefan Mai about 10 years@MichalW This works if you have only letters, which isn't always the case. If you're in that realm, you can probably do even better by using bitmasks on longs -- take on 8 characters at a time ;)
-
Rag about 10 yearsEvery time you assume characters are ASCII, God kills a kitten. :(
-
juanchopanza almost 10 yearsYour first example potentially has undefined behaviour (passing
char
to::tolower(int)
.) You need to ensure you don't pass a negative value. -
Cheers and hth. - Alf almost 10 years-1 this use of
::tolower
may well crash, it's UB for non-ASCII input. -
pavon almost 10 yearsWhile this should would be the canonical way to do this in a sane world, it has too many problems to recommend it. First, tolower from ctype.h doesn't work with unicode. Secondly, locale.h which is included by many of the other std library headers, defines a conflicting tolower, that causes headaches, see stackoverflow.com/q/5539249/339595. It is best to use std::locale or boost::locale::to_lower as other answers suggest.
-
Gábor Buella over 9 years"There is a way to convert upper case to lower WITHOUT doing if tests" ever heard of lookup tables?
-
DevSolar about 9 yearsFails for non-ASCII-7.
-
lmat - Reinstate Monica about 9 yearsThis is the correct answer in the general case. The standard gives nothing for handling anything except "ASCII" except lies and deception. It makes you think you can maybe deal with maybe UTF-16, but you can't. As this answer says, you cannot get the proper character-length (not byte-length) of a UTF-16 string without doing your own unicode handling. If you have to deal with real text, use ICU. Thanks, @DevSolar
-
Purefan over 8 yearsThis did not resize Ä into ä for me
-
NathanTempelman about 8 years::towlower if you're being international/using wide chars
-
BugShotGG about 8 years@MichalW Hey, can you explain what you wrote there? Also, why do we use
::
in::tolower
? -
Shital Shah almost 8 yearsIs ICU available by default on Ubuntu/Windows or needs to be install separately? Also how about this answer:stackoverflow.com/a/35075839/207661?
-
Luis Paulo almost 8 years@StefanMai Hi. Why is the "::" needed before "tolower"? I don't understand that.
-
Dan almost 8 yearsNote that this works for Unicode if you're using a
std::u32string
and your C locale is compatible with Unicode. -
Charles Ofria almost 8 yearsThe :: is needed before tolower to indicate that it is in the outermost namespace. If you use this code in another namespace, there may be a different (possibly unrelated) definition of tolower which would end up being preferentially selected without the ::.
-
quazar over 7 yearsNice, as long as you can convert the characters in place. What if your source string is
const
? That seems to make it a bit more messy (e.g. it doesn't look like you can usef.tolower()
), since you need to put the characters in a new string. Would you usetransform()
and something likestd::bind1st( std::mem_fun() )
for the operator? -
Sameer over 7 yearsFor a const string, we can just make a local copy and then convert it in place.
-
Alexis Wilke over 7 years@incises, this then someone posted an answer about ICU and that's certainly the way to go. Easier than most other solutions that would attempt to understand the locale.
-
quazar over 7 yearsYeah, though, making a copy adds more overhead.
-
chili about 7 yearsCould also use a back inserter iterator here instead of manual resize.
-
chili about 7 yearsYou could use std::transform with the version of ctype::tolower that does not take pointers. Use a back inserter iterator adapter and you don't even need to worry about pre-sizing your output string.
-
Arne Vogel almost 7 yearsGreat, especially because in libstdc++'s
tolower
withlocale
parameter, the implicit call touse_facet
appears to be a performance bottleneck. One of my coworkers has achieved a several 100% speed increase by replacingboost::iequals
(which has this problem) with a version whereuse_facet
is only called once outside of the loop. -
masaers almost 7 yearsicu::UnicodeString::length() is technically also lying to you (although less frequently), as it reports the number of 16bit code units rather than the number of code points. ;-)
-
DevSolar almost 7 years@masaers: To be completely fair, with things like combining characters, zero-width joiners and right-to-left markers, the number of code points is rather meaningless. I will remove that remark.
-
masaers almost 7 years@DevSolar Agreed! The concept of length is rather meaningless on text (we could add ligatures to the list of offenders). That said, since people are used to tabs and control chars taking up one length unit, code points would be the more intuitive measure. Oh, and thanks for giving the correct answer, sad to see it so far down :-(
-
kayleeFrye_onDeck almost 7 yearsI'd prefer to not use external libraries when possible, personally.
-
8.8.8.8 over 6 years
std::transform(data.begin(), data.end(), data.begin(), easytolower);
is dangerous. Since the behavior ofstd::tolower
undefined if the input is not representable asunsigned char
and is not equal toEOF
-
Clearer over 6 yearsI'm downvoting this for giving macros when a perfectly good solution exist -- you even give those solutions.
-
Volomike over 6 yearsThe macro technique means less typing of code for something that one would commonly use a lot in programming. Why not use that? Otherwise, why have macros at all?
-
Clearer over 6 yearsMacros are a legacy from C that's being worked hard on to get rid of. If you want to reduce the amount of typing, use a function or a lambda.
void strtoupper(std::string& x) { std::transform (x.begin(), x.end(), x.begin(), ::toupper); }
-
Volomike over 6 years@Clearer As I want to be a better coder, can you provide me any ANSI doc links where any ANSI C++ committees say something to the effect of, "We need to call a meeting to get rid of macros out of C++"? Or some other roadmap?
-
Clearer over 6 yearsNo, I can't. Bjarne's stance on the topic has been made pretty clear on several occasions though. Besides, there are plenty of reasons to not use macros in C as well as C++.
x
could be a valid expression, that just happens to compile correctly but will give completely bogus results because of the macros. -
T.E.D. over 6 years@BrianGordon - But its much easier, and there really are way too many cats in the world already.
-
Roland Illig over 6 yearsUndefined behavior for negative chars.
-
Cort Ammon over 6 years@BrianGordon That is blatantly false, as proven by the fact that there are still kittens in the world! =)
-
TypicalHog about 6 yearsWhat makes the 2nd solution non-portable? Can I just do this? pastebin.com/MPRMpQJS
-
bfontaine about 6 yearsPlease clarify where did you copy your answer from.
-
Alnitak almost 6 years@BrianGordon there are also cases when you know that the input is ASCII (e.g. the wire format of domain names).
-
Rag almost 6 years@Alnitak I didn't know that. How does DNS handle international domain names which can be in unicode?
-
Alnitak almost 6 years@BrianGordon applications have to convert them into an all-ASCII encoding called "Punycode" (RFC 3492)
-
Aquarius Power almost 6 yearsgood macros! @Clearer macros help us so much... I expect they never get rid of it.
-
Clearer almost 6 years@AquariusPower I disagree. I have yet to see a macro that could not have been done better as a template or a lambda.
-
DevSolar over 5 years@TypicalHog: Because there is no guarantee that
'A'
to'Z'
is a continuous range (EBCDIC); but more importantly because there are letters outside that range ('Ü'
,'á'
, ...). It's very, very sad that the authors prefer to harvest more upvotes for answers with non-portable solutions instead of properly pointing out their shortcomings... -
Violet Giraffe over 5 years@DevSolar:
easytolower
seems a perfectly valid solution for latin ASCII symbols to me. Going to use it for normalizing HTML tag names. -
Pavel P over 5 years@Cheersandhth.-Alf c99 doesn't mention that it's UB: it either returns lower char, or unmodified.
std::tolower
, however, mentions ub -
Deduplicator almost 5 years@L.F. I fixed your fix.
-
L. F. almost 5 years@Deduplicator To be honest, I have always been having trouble understanding why the
char
has to be convertedunsigned char
first. Isn't the value of a (signed)char
supposed to be nonnegative, anyway? What is the point oftolower
ing a negativechar
? I guess I am missing the point, so would you mind explaining it to be a little bit please :) -
Deduplicator almost 5 years@L.F. No,
char
can be analogous tosigned char
, and asigned char
can be negative.tolower
only acceptsunsigned char
and-1
. Anything outside its domain is UB, and you don't want to conflate with-1
either. While all members of the basic execution character set are non-negative, that does not necessarily hold for the (complete) execution character set. See the current draft. -
L. F. almost 5 years@Deduplicator Thank you! I didn't know a
char
can validly be negative. But then, doesn't converting tounsigned char
just change the value? -
Deduplicator almost 5 years@L.F.
char
->unsigned char
(value-preserving, modulo 2**CHAR_BIT) -> implicit toint
(value-preserving). Of course, ifsizeof(int) == 1
, things pretty much fall apart. -
L. F. almost 5 years@Deduplicator OK ... I think I missed that ... Then the
int
is converted tochar
, I think, so the resulting value is implementation-defined before C++20 and guaranteed to be the original value since C++20? -
Deduplicator almost 5 years@L.F. Converting the result from
tolower()
(int
) back tochar
is also an interesting story, yes. -
JPhi1618 over 4 yearsI don't understand why the tolower here is wrapped in a lambda rather than just passing it to transform on its own.
-
L. F. about 4 years@JPhi1618 1) to make sure that the character is first converted to
unsigned char
(see Deduplicator's comments above); 2) to enable overload resolution to select theint tolower( int ch );
overload defined in<cctype>
instead of thetemplate< class charT > charT tolower( charT ch, const locale& loc );
overload defined in<clocale>
. -
Contango about 4 yearsModern CPUs are bottlenecked in memory not CPU. Benchmarking would be interesting.
-
Juv about 4 yearsThis is what I needed. I just used the
towlower
for wide characters which supports the UTF-16. -
CCJ almost 4 yearshappily coding in Java and the time comes to switch over to a CPP module... comes along a simple string case issue Me: "I'll just look up the std::string toLower() or whatever the standard has for normalizing text case... Hmm, I wonder how they handle all the encoding and localization complexities a 'simple' task like that could entail when std::string is just raw text data?" finds this question... sad requiring that ingest data follows a case convention noises
-
Deduplicator over 3 yearsActually,
std::string
not being aware that it contains text in a multi-byte character-encoding is a feature, not a bug. It's the only sane way to do it, which is why just about everyone does it. Not having proper standard apis for handling anything but basic text from days gone by which never really were at all is a problem though, yes. It would have to be optional even in a hosted environment though, as it is quite hefty, and there are many cases where it isn't needed. -
DevSolar over 3 years@Deduplicator: Sorry, but that's just dodging it in all possible ways. There are standards (Unicode), there are quasi-standard APIs for handling it (ICU), and if your intention is to write code that properly converts text to lowercase, unless you can guarantee your code will only ever see ASCII-7 (which would be a rather special case), all the other "solutions" here are 80--20 at best.
-
Deduplicator over 3 yearsThat is why there should be such standard APIs. Doesn't negate the fact that much string-manipulation is best done ignoring all but it being a sequence of code-units. And that many use-cases never need anything more sophisticated.
-
DevSolar over 3 years@Deduplicator And that standard API is currently the ICU library, which is what this answer is about.
-
DevSolar about 3 years@Deduplicator I heard that
std::text
is underway, perhaps even in time for C++23. Let's not give up all hope yet. -
Rodrigo about 3 yearsI guess it won't work for UTF-8, will it?
-
prehistoricpenguin almost 3 yearsThis function is slow, shouldn't be used in real-life projects.
-
prehistoricpenguin almost 3 yearsThis is pretty slow, see this benchmark: godbolt.org/z/neM5jsva1
-
Velkan over 2 yearsA working example?
-
Dmitry Grigoryev over 2 yearsThis is plain wrong: if you check the documentation, you will see that
std::tolower
cannot work withchar
, it only supportsunsigned char
. So this code is UB ifstr
contains characters outside of 0x00-0x7F. -
Dmitry Grigoryev over 2 yearsThis won't work in Windows where you'd have to call
std::locale("English_Unites States.UTF8")
. -
JadeSpy about 2 yearsI don't think you need to wrap std::tolower in a lambda.
-
Mayou36 about 2 years@prehistoricpenguin slow? Well, slow is to debug code because your own implementation has a bug because it was more complicated than to just call the boost library ;) If the code is critical, like called a lot and provides a bottleneck, then, well, it can be worth to think about slowness
-
Kiruahxh almost 2 years
icu::UnicodeString
seem to be a good class. QString also can do the job. However it is a pain to use in big programs with many libraries. I hopestd::text
will be a real thing soon