efficiently checking that string consists of one character in Python
Solution 1
This is by far the fastest, several times faster than even count()
, just time it with that excellent mgilson's timing suite:
s == len(s) * s[0]
Here all the checking is done inside the Python C code which just:
- allocates len(s) characters;
- fills the space with the first character;
- compares two strings.
The longer the string is, the greater is time bonus. However, as mgilson writes, it creates a copy of the string, so if your string length is many millions of symbols, it may become a problem.
As we can see from timing results, generally the fastest ways to solve the task do not execute any Python code for each symbol. However, the set()
solution also does all the job inside C code of the Python library, but it is still slow, probably because of operating string through Python object interface.
UPD: Concerning the empty string case. What to do with it strongly depends on the task. If the task is "check if all the symbols in a string are the same", s == len(s) * s[0]
is a valid answer (no symbols mean an error, and exception is ok). If the task is "check if there is exactly one unique symbol", empty string should give us False, and the answer is s and s == len(s) * s[0]
, or bool(s) and s == len(s) * s[0]
if you prefer receiving boolean values. Finally, if we understand the task as "check if there are no different symbols", the result for empty string is True, and the answer is not s or s == len(s) * s[0]
.
Solution 2
>>> s = 'AAAAAAAAAAAAAAAAAAA'
>>> s.count(s[0]) == len(s)
True
This doesn't short circuit. A version which does short-circuit would be:
>>> all(x == s[0] for x in s)
True
However, I have a feeling that due the the optimized C implementation, the non-short circuiting version will probably perform better on some strings (depending on size, etc)
Here's a simple timeit
script to test some of the other options posted:
import timeit
import re
def test_regex(s,regex=re.compile(r'^(.)\1*$')):
return bool(regex.match(s))
def test_all(s):
return all(x == s[0] for x in s)
def test_count(s):
return s.count(s[0]) == len(s)
def test_set(s):
return len(set(s)) == 1
def test_replace(s):
return not s.replace(s[0],'')
def test_translate(s):
return not s.translate(None,s[0])
def test_strmul(s):
return s == s[0]*len(s)
tests = ('test_all','test_count','test_set','test_replace','test_translate','test_strmul','test_regex')
print "WITH ALL EQUAL"
for test in tests:
print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="AAAAAAAAAAAAAAAAA"'%test)
if globals()[test]("AAAAAAAAAAAAAAAAA") != True:
print globals()[test]("AAAAAAAAAAAAAAAAA")
raise AssertionError
print
print "WITH FIRST NON-EQUAL"
for test in tests:
print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="FAAAAAAAAAAAAAAAA"'%test)
if globals()[test]("FAAAAAAAAAAAAAAAA") != False:
print globals()[test]("FAAAAAAAAAAAAAAAA")
raise AssertionError
On my machine (OS-X 10.5.8, core2duo, python2.7.3) with these contrived (short) strings, str.count
smokes set
and all
, and beats str.replace
by a little, but is edged out by str.translate
and strmul
is currently in the lead by a good margin:
WITH ALL EQUAL
test_all 5.83863711357
test_count 0.947771072388
test_set 2.01028490067
test_replace 1.24682998657
test_translate 0.941282987595
test_strmul 0.629556179047
test_regex 2.52913498878
WITH FIRST NON-EQUAL
test_all 2.41147494316
test_count 0.942595005035
test_set 2.00480484962
test_replace 0.960338115692
test_translate 0.924381017685
test_strmul 0.622269153595
test_regex 1.36632800102
The timings could be slightly (or even significantly?) different between different systems and with different strings, so that would be worth looking into with an actual string you're planning on passing.
Eventually, if you hit the best case for all
enough, and your strings are long enough, you might want to consider that one. It's a better algorithm ... I would avoid the set
solution though as I don't see any case where it could possibly beat out the count
solution.
If memory could be an issue, you'll need to avoid str.translate
, str.replace
and strmul
as those create a second string, but this isn't usually a concern these days.
Solution 3
You could convert to a set and check there is only one member:
len(set("AAAAAAAA"))
Solution 4
Try using the built-in function all
:
all(c == 'A' for c in s)
Solution 5
Adding another solution to this problem
>>> not "AAAAAA".translate(None,"A")
True
Related videos on Youtube
Mazdak
Updated on July 05, 2022Comments
-
Mazdak almost 2 years
What is an efficient way to check that a string
s
in Python consists of just one character, say'A'
? Something likeall_equal(s, 'A')
which would behave like this:all_equal("AAAAA", "A") = True all_equal("AAAAAAAAAAA", "A") = True all_equal("AAAAAfAAAAA", "A") = False
Two seemingly inefficient ways would be to: first convert the string to a list and check each element, or second to use a regular expression. Are there more efficient ways or are these the best one can do in Python? Thanks.
-
mikołak over 11 yearsI'm a bit surprised no one yet asked the following question: what is the structure of the "non-uniform" strings that you input? If there is one (i.e. they are not completely random), you can use the knowledge about that to optimize your algorithm.
-
Barmar over 11 yearsIs efficiency of this really that important? I wonder what kind of application would have this check in the bottleneck code.
-
Ellioh over 11 yearsTo my mind, the efficiency of that rarely can be important, though who knows? Anyway, the task is nice, and understanding what's happening inside the code and why some solutions are slow, and some are fast, is a useful training for a Python developer.
-
-
mgilson over 11 years@InbarRose -- Those time differences are on the order of 1% -- It's not worth worrying about IMHO. Especially considering that
is
is relying on aCpython
implementation detail. -
Inbar Rose over 11 yearsthat's what
all
does as well. once it finds a value which is false, it stops. -
Inbar Rose over 11 years@mgilson I do not remember exactly where I read it, but I recall that when dealing with char comparison its better to use
is
. -
mgilson over 11 years@InbarRose -- I doubt it. AFAIK, there is no guarantee that characters are singletons (though I suppose I could be wrong about this)
-
Silas Ray over 11 yearsI don't think
is
is guaranteed to work. It's relying on an implementation detail of the interpreter, namely interning of strings, which isn't guaranteed according to the Python spec as far as I recall. Yes it will probably work everywhere at least for short strings, but 'probably' is not really the best thing to rely on. -
Ellioh over 11 yearsnot len("AAAAAAAAA".replace('A', '')) #0 means True, so not operator is required
-
Master_Yoda over 11 yearsInteresting. Makes sense. Still, probably more efficient than the set method, right?
-
Ellioh over 11 yearsMade an addition to your timing suite. My version is still much worse than count() if the string consists of "A"s, but slightly better when the first char is 'f'.
-
Ellioh over 11 yearsWOW! It seems I have found a faster version: s == len(s) * s[0]
-
mgilson over 11 yearsSurprisingly,
translate
is winning in my timings ... (but not by much) (+1). -
Inbar Rose over 11 yearsyou can check yourself with
timeit
-
mgilson over 11 yearsAnd apparently not winning for too long. You've been beat out by
s == s[0]*len(s)
:) -
mgilson over 11 yearsIt probably beats count in the non-equal case because it can short-circuit the string comparison. It still surprises me that it wins so well though. It seems like
count
should be a simple loop which would do approximately the same amount of work that the filling of the space would do. I suppose that filling is probably done via some impressive vectorized/optimized code in the C string library. (As is the comparison which can be short circuited). -
Ellioh over 11 yearsI think that count() looks for a string, not for a single character. Probably its algorithm is much more complex than simple strncmp() in comparison of two strings. If there were a special case of count() searching for a single character, it would work much faster, but apparently there is no such case.
-
mgilson over 11 yearsAhh, right. That must be the case. I had forgotten that
count
could operate on strings with longer length than 1 :). Silly me. -
Hammerite over 11 yearsWhat if s is the empty string?
-
sea-rob over 11 yearsIs == taking advantage of machine-level "word" comparisons for fast string compares? So on a 64-bit system matching 4 or 8 characters at a time. To beat the memory problem, you could iterate and match against a fixed string s[0]*1024 over and over until the end. lol nice answer!
-
sea-rob over 11 yearslol strmul, and to some degree count, hold up as the string scales. Here are the times I get for a source string of 1024 characters: [('test_strmul', 0.9134591826614835), ('test_set', 45.61518321462644), ('test_count', 2.706394388113573)]
-
mgilson over 11 years@oefe -- You never really know anything until you actually measure :)
-
Ellioh over 11 years@Hammerite -- empty string requires checking, here we were just testing the ideas. In addition, I don't know what to say about empty strings, True or False (is the question "check if a string contains EXACTLY ONE unique cheracter", or "check if there is not more than one unique character"). Depending on that the full answer is either "not s or s[0] * len(s) == s" or "s and s[0] * len(s) == s"
-
Hammerite over 11 yearsAs a mathematician, my instinct is that the function should return True if s is the empty string. But the OP should really indicate.
-
Hammerite over 11 yearsYou say 'If the task is "check if all the symbols in a string are the same"... [then throwing an exception is fine]'. I disagree. All symbols in the empty string are the same, in the sense that you cannot show me two different symbols that are in the empty string.
-
Ellioh over 11 years"All symbols in the empty string are the same"? I'd rather say that none of them are different. What is the main point: existence of equal ones, or absence of not equal? :-) So this is like a question "are points a special case of an infinitely short line?" Purely depends on initial set of definitions. So I find all the three options possible.
-
Barmar over 11 yearsHow does the RE
^(.)\1*$
compare? -
mgilson over 11 years@Barmar -- I don't know? Sorry, I'm not a regex guru, so you might need to spell this one out for me so I can add it to the test-suite.
-
Barmar over 11 yearsAnd I'm not a regular Python programmer. But something like
re.match('^(1)\1*$', s)
. -
mgilson over 11 years@Barmar -- Your regex solution is in the middle somewhere. Not bad.
-
Ellioh over 11 years'^' in regex is not required for match(), and removing '^' makes the regex solution a little faster.
-
sotapme about 11 yearsSurely it should be
s == len(s) * c[0]
as it seems to mes == len(s) * s[0]
just tests that a string is made up of the same character. Whereas the original OP was implying that there were 2 variables s(tring) to test against and (c)character to check for. -
umbreonben almost 5 years"no symbols mean an error, and exception is ok" -- according to who exactly?
-
Ellioh over 4 yearsAccording to common Python conventions. Errors should raise exceptions. However, the key word is "if". If the task is "check if all the symbols in a string are the same", that at least assumes the symbols to exist. But the other versions of the task text do not. So, it's correct just to ask "What would you like to get in empty string case?" Everything is possible. :-)
-
Timo over 3 yearsMeans the asterisk "repetition of spaces" or rep. of chars here? The comparison is too short to understand "at heart" for me.
-
Ellioh over 3 yearsIt's repetition of the string. The string to be repeated is s[0] and consists of a single symbol.