efficiently checking that string consists of one character in Python

40,052

Solution 1

This is by far the fastest, several times faster than even count(), just time it with that excellent mgilson's timing suite:

s == len(s) * s[0]

Here all the checking is done inside the Python C code which just:

  • allocates len(s) characters;
  • fills the space with the first character;
  • compares two strings.

The longer the string is, the greater is time bonus. However, as mgilson writes, it creates a copy of the string, so if your string length is many millions of symbols, it may become a problem.

As we can see from timing results, generally the fastest ways to solve the task do not execute any Python code for each symbol. However, the set() solution also does all the job inside C code of the Python library, but it is still slow, probably because of operating string through Python object interface.

UPD: Concerning the empty string case. What to do with it strongly depends on the task. If the task is "check if all the symbols in a string are the same", s == len(s) * s[0] is a valid answer (no symbols mean an error, and exception is ok). If the task is "check if there is exactly one unique symbol", empty string should give us False, and the answer is s and s == len(s) * s[0], or bool(s) and s == len(s) * s[0] if you prefer receiving boolean values. Finally, if we understand the task as "check if there are no different symbols", the result for empty string is True, and the answer is not s or s == len(s) * s[0].

Solution 2

>>> s = 'AAAAAAAAAAAAAAAAAAA'
>>> s.count(s[0]) == len(s)
True

This doesn't short circuit. A version which does short-circuit would be:

>>> all(x == s[0] for x in s)
True

However, I have a feeling that due the the optimized C implementation, the non-short circuiting version will probably perform better on some strings (depending on size, etc)


Here's a simple timeit script to test some of the other options posted:

import timeit
import re

def test_regex(s,regex=re.compile(r'^(.)\1*$')):
    return bool(regex.match(s))

def test_all(s):
    return all(x == s[0] for x in s)

def test_count(s):
    return s.count(s[0]) == len(s)

def test_set(s):
    return len(set(s)) == 1

def test_replace(s):
    return not s.replace(s[0],'')

def test_translate(s):
    return not s.translate(None,s[0])

def test_strmul(s):
    return s == s[0]*len(s)

tests = ('test_all','test_count','test_set','test_replace','test_translate','test_strmul','test_regex')

print "WITH ALL EQUAL"
for test in tests:
    print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="AAAAAAAAAAAAAAAAA"'%test)
    if globals()[test]("AAAAAAAAAAAAAAAAA") != True:
        print globals()[test]("AAAAAAAAAAAAAAAAA")
        raise AssertionError

print
print "WITH FIRST NON-EQUAL"
for test in tests:
    print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="FAAAAAAAAAAAAAAAA"'%test)
    if globals()[test]("FAAAAAAAAAAAAAAAA") != False:
        print globals()[test]("FAAAAAAAAAAAAAAAA")
        raise AssertionError

On my machine (OS-X 10.5.8, core2duo, python2.7.3) with these contrived (short) strings, str.count smokes set and all, and beats str.replace by a little, but is edged out by str.translate and strmul is currently in the lead by a good margin:

WITH ALL EQUAL
test_all 5.83863711357
test_count 0.947771072388
test_set 2.01028490067
test_replace 1.24682998657
test_translate 0.941282987595
test_strmul 0.629556179047
test_regex 2.52913498878

WITH FIRST NON-EQUAL
test_all 2.41147494316
test_count 0.942595005035
test_set 2.00480484962
test_replace 0.960338115692
test_translate 0.924381017685
test_strmul 0.622269153595
test_regex 1.36632800102

The timings could be slightly (or even significantly?) different between different systems and with different strings, so that would be worth looking into with an actual string you're planning on passing.

Eventually, if you hit the best case for all enough, and your strings are long enough, you might want to consider that one. It's a better algorithm ... I would avoid the set solution though as I don't see any case where it could possibly beat out the count solution.

If memory could be an issue, you'll need to avoid str.translate, str.replace and strmul as those create a second string, but this isn't usually a concern these days.

Solution 3

You could convert to a set and check there is only one member:

len(set("AAAAAAAA"))

Solution 4

Try using the built-in function all:

all(c == 'A' for c in s)

Solution 5

Adding another solution to this problem

>>> not "AAAAAA".translate(None,"A")
True
Share:
40,052

Related videos on Youtube

Mazdak
Author by

Mazdak

Updated on July 05, 2022

Comments

  • Mazdak
    Mazdak almost 2 years

    What is an efficient way to check that a string s in Python consists of just one character, say 'A'? Something like all_equal(s, 'A') which would behave like this:

    all_equal("AAAAA", "A") = True
    
    all_equal("AAAAAAAAAAA", "A") = True
    
    all_equal("AAAAAfAAAAA", "A") = False
    

    Two seemingly inefficient ways would be to: first convert the string to a list and check each element, or second to use a regular expression. Are there more efficient ways or are these the best one can do in Python? Thanks.

    • mikołak
      mikołak over 11 years
      I'm a bit surprised no one yet asked the following question: what is the structure of the "non-uniform" strings that you input? If there is one (i.e. they are not completely random), you can use the knowledge about that to optimize your algorithm.
    • Barmar
      Barmar over 11 years
      Is efficiency of this really that important? I wonder what kind of application would have this check in the bottleneck code.
    • Ellioh
      Ellioh over 11 years
      To my mind, the efficiency of that rarely can be important, though who knows? Anyway, the task is nice, and understanding what's happening inside the code and why some solutions are slow, and some are fast, is a useful training for a Python developer.
  • mgilson
    mgilson over 11 years
    @InbarRose -- Those time differences are on the order of 1% -- It's not worth worrying about IMHO. Especially considering that is is relying on a Cpython implementation detail.
  • Inbar Rose
    Inbar Rose over 11 years
    that's what all does as well. once it finds a value which is false, it stops.
  • Inbar Rose
    Inbar Rose over 11 years
    @mgilson I do not remember exactly where I read it, but I recall that when dealing with char comparison its better to use is.
  • mgilson
    mgilson over 11 years
    @InbarRose -- I doubt it. AFAIK, there is no guarantee that characters are singletons (though I suppose I could be wrong about this)
  • Silas Ray
    Silas Ray over 11 years
    I don't think is is guaranteed to work. It's relying on an implementation detail of the interpreter, namely interning of strings, which isn't guaranteed according to the Python spec as far as I recall. Yes it will probably work everywhere at least for short strings, but 'probably' is not really the best thing to rely on.
  • Ellioh
    Ellioh over 11 years
    not len("AAAAAAAAA".replace('A', '')) #0 means True, so not operator is required
  • Master_Yoda
    Master_Yoda over 11 years
    Interesting. Makes sense. Still, probably more efficient than the set method, right?
  • Ellioh
    Ellioh over 11 years
    Made an addition to your timing suite. My version is still much worse than count() if the string consists of "A"s, but slightly better when the first char is 'f'.
  • Ellioh
    Ellioh over 11 years
    WOW! It seems I have found a faster version: s == len(s) * s[0]
  • mgilson
    mgilson over 11 years
    Surprisingly, translate is winning in my timings ... (but not by much) (+1).
  • Inbar Rose
    Inbar Rose over 11 years
    you can check yourself with timeit
  • mgilson
    mgilson over 11 years
    And apparently not winning for too long. You've been beat out by s == s[0]*len(s) :)
  • mgilson
    mgilson over 11 years
    It probably beats count in the non-equal case because it can short-circuit the string comparison. It still surprises me that it wins so well though. It seems like count should be a simple loop which would do approximately the same amount of work that the filling of the space would do. I suppose that filling is probably done via some impressive vectorized/optimized code in the C string library. (As is the comparison which can be short circuited).
  • Ellioh
    Ellioh over 11 years
    I think that count() looks for a string, not for a single character. Probably its algorithm is much more complex than simple strncmp() in comparison of two strings. If there were a special case of count() searching for a single character, it would work much faster, but apparently there is no such case.
  • mgilson
    mgilson over 11 years
    Ahh, right. That must be the case. I had forgotten that count could operate on strings with longer length than 1 :). Silly me.
  • Hammerite
    Hammerite over 11 years
    What if s is the empty string?
  • sea-rob
    sea-rob over 11 years
    Is == taking advantage of machine-level "word" comparisons for fast string compares? So on a 64-bit system matching 4 or 8 characters at a time. To beat the memory problem, you could iterate and match against a fixed string s[0]*1024 over and over until the end. lol nice answer!
  • sea-rob
    sea-rob over 11 years
    lol strmul, and to some degree count, hold up as the string scales. Here are the times I get for a source string of 1024 characters: [('test_strmul', 0.9134591826614835), ('test_set', 45.61518321462644), ('test_count', 2.706394388113573)]
  • mgilson
    mgilson over 11 years
    @oefe -- You never really know anything until you actually measure :)
  • Ellioh
    Ellioh over 11 years
    @Hammerite -- empty string requires checking, here we were just testing the ideas. In addition, I don't know what to say about empty strings, True or False (is the question "check if a string contains EXACTLY ONE unique cheracter", or "check if there is not more than one unique character"). Depending on that the full answer is either "not s or s[0] * len(s) == s" or "s and s[0] * len(s) == s"
  • Hammerite
    Hammerite over 11 years
    As a mathematician, my instinct is that the function should return True if s is the empty string. But the OP should really indicate.
  • Hammerite
    Hammerite over 11 years
    You say 'If the task is "check if all the symbols in a string are the same"... [then throwing an exception is fine]'. I disagree. All symbols in the empty string are the same, in the sense that you cannot show me two different symbols that are in the empty string.
  • Ellioh
    Ellioh over 11 years
    "All symbols in the empty string are the same"? I'd rather say that none of them are different. What is the main point: existence of equal ones, or absence of not equal? :-) So this is like a question "are points a special case of an infinitely short line?" Purely depends on initial set of definitions. So I find all the three options possible.
  • Barmar
    Barmar over 11 years
    How does the RE ^(.)\1*$ compare?
  • mgilson
    mgilson over 11 years
    @Barmar -- I don't know? Sorry, I'm not a regex guru, so you might need to spell this one out for me so I can add it to the test-suite.
  • Barmar
    Barmar over 11 years
    And I'm not a regular Python programmer. But something like re.match('^(1)\1*$', s).
  • mgilson
    mgilson over 11 years
    @Barmar -- Your regex solution is in the middle somewhere. Not bad.
  • Ellioh
    Ellioh over 11 years
    '^' in regex is not required for match(), and removing '^' makes the regex solution a little faster.
  • sotapme
    sotapme about 11 years
    Surely it should be s == len(s) * c[0] as it seems to me s == len(s) * s[0] just tests that a string is made up of the same character. Whereas the original OP was implying that there were 2 variables s(tring) to test against and (c)character to check for.
  • umbreonben
    umbreonben almost 5 years
    "no symbols mean an error, and exception is ok" -- according to who exactly?
  • Ellioh
    Ellioh over 4 years
    According to common Python conventions. Errors should raise exceptions. However, the key word is "if". If the task is "check if all the symbols in a string are the same", that at least assumes the symbols to exist. But the other versions of the task text do not. So, it's correct just to ask "What would you like to get in empty string case?" Everything is possible. :-)
  • Timo
    Timo over 3 years
    Means the asterisk "repetition of spaces" or rep. of chars here? The comparison is too short to understand "at heart" for me.
  • Ellioh
    Ellioh over 3 years
    It's repetition of the string. The string to be repeated is s[0] and consists of a single symbol.