Ignore case in Python strings

python string case-insensitive

108,174

Solution 1

In response to your clarification...

You could use ctypes to execute the c function "strcasecmp". Ctypes is included in Python 2.5. It provides the ability to call out to dll and shared libraries such as libc. Here is a quick example (Python on Linux; see link for Win32 help):

from ctypes import *
libc = CDLL("libc.so.6")  // see link above for Win32 help
libc.strcasecmp("THIS", "this") // returns 0
libc.strcasecmp("THIS", "THAT") // returns 8

may also want to reference strcasecmp documentation

Not really sure this is any faster or slower (have not tested), but it's a way to use a C function to do case insensitive string comparisons.

~~~~~~~~~~~~~~

ActiveState Code - Recipe 194371: Case Insensitive Strings is a recipe for creating a case insensitive string class. It might be a bit over kill for something quick, but could provide you with a common way of handling case insensitive strings if you plan on using them often.

Solution 2

Are you using this compare in a very-frequently-executed path of a highly-performance-sensitive application? Alternatively, are you running this on strings which are megabytes in size? If not, then you shouldn't worry about the performance and just use the .lower() method.

The following code demonstrates that doing a case-insensitive compare by calling .lower() on two strings which are each almost a megabyte in size takes about 0.009 seconds on my 1.8GHz desktop computer:

from timeit import Timer

s1 = "1234567890" * 100000 + "a"
s2 = "1234567890" * 100000 + "B"

code = "s1.lower() < s2.lower()"
time = Timer(code, "from __main__ import s1, s2").timeit(1000)
print time / 1000   # 0.00920499992371 on my machine

If indeed this is an extremely significant, performance-critical section of code, then I recommend writing a function in C and calling it from your Python code, since that will allow you to do a truly efficient case-insensitive search. Details on writing C extension modules can be found here: https://docs.python.org/extending/extending.html

Solution 3

Your question implies that you don't need Unicode. Try the following code snippet; if it works for you, you're done:

Python 2.5.2 (r252:60911, Aug 22 2008, 02:34:17)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US")
'en_US'
>>> sorted("ABCabc", key=locale.strxfrm)
['a', 'A', 'b', 'B', 'c', 'C']
>>> sorted("ABCabc", cmp=locale.strcoll)
['a', 'A', 'b', 'B', 'c', 'C']

Clarification: in case it is not obvious at first sight, locale.strcoll seems to be the function you need, avoiding the str.lower or locale.strxfrm "duplicate" strings.

Solution 4

I can't find any other built-in way of doing case-insensitive comparison: The python cook-book recipe uses lower().

However you have to be careful when using lower for comparisons because of the Turkish I problem. Unfortunately Python's handling for Turkish Is is not good. ı is converted to I, but I is not converted to ı. İ is converted to i, but i is not converted to İ.

Solution 5

There's no built in equivalent to that function you want.

You can write your own function that converts to .lower() each character at a time to avoid duplicating both strings, but I'm sure it will very cpu-intensive and extremely inefficient.

Unless you are working with extremely long strings (so long that can cause a memory problem if duplicated) then I would keep it simple and use

str1.lower() == str2.lower()

You'll be ok

View more solutions

108,174

Author by

Paul Oyster

Updated on July 09, 2022

Comments

Paul Oyster almost 2 years

What is the easiest way to compare strings in Python, ignoring case?

Of course one can do (str1.lower() <= str2.lower()), etc., but this created two additional temporary strings (with the obvious alloc/g-c overheads).

I guess I'm looking for an equivalent to C's stricmp().

[Some more context requested, so I'll demonstrate with a trivial example:]

Suppose you want to sort a looong list of strings. You simply do theList.sort(). This is O(n * log(n)) string comparisons and no memory management (since all strings and list elements are some sort of smart pointers). You are happy.

Now, you want to do the same, but ignore the case (let's simplify and say all strings are ascii, so locale issues can be ignored). You can do theList.sort(key=lambda s: s.lower()), but then you cause two new allocations per comparison, plus burden the garbage-collector with the duplicated (lowered) strings. Each such memory-management noise is orders-of-magnitude slower than simple string comparison.

Now, with an in-place stricmp()-like function, you do: theList.sort(cmp=stricmp) and it is as fast and as memory-friendly as theList.sort(). You are happy again.

The problem is any Python-based case-insensitive comparison involves implicit string duplications, so I was expecting to find a C-based comparisons (maybe in module string).

Could not find anything like that, hence the question here. (Hope this clarifies the question).
Paul Oyster over 15 years

the question is more general than the example itself (actually, in real life scenarios you don't want to be bothered by attaching a lowercase version to every string that might need icmp() later), but even in this trivial example, you don't want to double the memory only to be able to sort...
Paul Oyster over 15 years

i know this recipe well, but behind the scenes it simply have a lowercased duplicate for every string, which is no good (as explained in the trivial example I added)
Ishbir over 15 years

Case-insensitive regular expressions can only be used for equality tests (True/False), not comparison (less than/equal/greater than)
Ishbir over 15 years

"Never say never" :) "There is no built in equivalent" is absolute; "I know of no built in equivalent" would be closer to the truth. locale.strcoll, given a case-insensitive LC_COLLATE (as 'en_US' is), is a built-in.
Paul Oyster over 15 years

The global setting of locale.setlocale() is obviously an overkill (way too global).
Paul Oyster over 15 years

tupples are cheap, but the duplication of strings is not...
Paul Oyster over 15 years

The ctype solution is what I was looking for, thanks. For reference, here is the win32 code: from ctypes import * clib = cdll.LoadLibrary("msvcrt") theList = ["abc","ABC","def","DEF"] * 1000000 theList.sort(cmp = clib._stricmp)
Admin over 15 years

this is much slower. see my answer!
Admin over 15 years

this is also what python's sort with the key= argument does.
Darius Bacon over 15 years

I believe this gives the wrong answer for strings with nulls in them.
Ishbir over 15 years

I don't know what the "obvious overkill" is, and the "global" setting can be as localized as you like (except if you work with threads and need some threads localized and some not, for some reason).
John Machin about 14 years

[I wish there was a virtual rubber-stamp for this] Don't use $, use \Z. Read the fantastic manual to find out what $ actually does; don't rely on legend or guesswork or whatever.
Gorm Casper about 14 years

I changed it. I also turned on the community wiki feature for my answer. Thanks.
Neil Mayhew over 13 years

This is the only solution that produces results that can interoperate correctly with case insensitive utilities such as Unix sort with the -f option. For example, str.lower causes A_ to sort before AA.
Manav about 13 years

so this is how you pass stuff to the Timer class. thanks for solving a very different itch of mine :)
tchrist almost 13 years

No, this is all wrong. The only correct solution compares their Unicode casefolds. Otherwise you will screw up.
tchrist almost 13 years

You cannot use POSIX locales and strcoll, because it is unreliable across platforms. You must use Unicode casefolds, which are guaranteed to work the same everywhere.
tchrist almost 13 years

This is completely wrong. It fails to detect that ΣΤΙΓΜΑΣ and στιγμας are the same case insenstively. You must not use casemapping to compare case in Unicode. You must use casefolding. These are different things. Σ, σ, ς are all the same, just as S, ſ, s (what is it with s’s anyway? :) and Μ, μ, µ are. There are innumerably may other similar circumstances, like how weiß, WEIẞ, weiss, WEISS are all the same too, or eﬃcient, efficient. You must use casefolds, because casemaps don’t work.
tchrist almost 13 years

This is a 7-bit mindset that is wholly inappropriate for Unicode data. You must either use the full Unicode casefold, or else the primary collation strength per the Unicode Collation Algorithm. Yes, that means new copies of the string either way, but at least then you can do a binary comparison instead of having to rummage through the tables for each code point.
tchrist almost 13 years

Python doesn’t handle Unicode very robustly, as you have seen. The casemaps don’t pay attention to these things. Very sad.
tchrist almost 13 years

This answer is wrong. The only correct way is str1.fold() == str2.fold(), but that requires an extension to the default python string class that supports the full Unicode casefold of a string. It’s a missing function.
n611x007 about 10 years

@tchrist unclearr: is there such an extension available?
n611x007 about 10 years

@tchrist as a side note, the string 'unicode' is not in the question. Nevertheless, your comment that seem to point anyone coming here with a hope for unicode in the right direction are invaluable. I attempt to edit the question title to more accurately reflect the question.
martineau almost 10 years

Good only for equality testing, which isn't as quite the same thing as comparing two strings and determining whether one is less-than, equal-to, or greater-than the other.
Gorm Casper over 9 years

@martineau thanks. I added a note, and also did some searching and found a solution that I think I'd be more comfortable with, and updated my answer with it. It isn't a full answer, though. Hopefully someone (myself if I get around to it) will learn how one of these libraries works and provide a code sample.
martineau over 9 years

Yes, it sounds like the pyuca (Python Unicode Collation Algorithm) extension might work because the report it's based on -- the Unicode Collation Algorithm (UCA) -- says "Case differences (uppercase versus lowercase), are typically ignored".