Check if any (all) character of a string is in a given range

10,651

Solution 1

You can speed up the check by using a set (O(1) contains check), especially if you are checking multiple strings for the same range since the initial set creation requires one iteration as well. You can then use all for the early-breaking iteration pattern which fits better than any here:

import string

ascii = set(string.ascii_uppercase)
ascii_all = set(string.ascii_uppercase + string.ascii_lowercase)

if all(x in ascii for x in my_string1):
    # my_string1 is all ascii

Of course, any all construct can be transformed to an any via DeMorgan's Law:

if not any(x not in ascii for x in my_string1):
    # my_string1 is all ascii

Update:

One good pure set based approach not requiring a complete iteration as pointed out by Artyer:

if ascii.issuperset(my_string1):
    # my_string1 is all ascii

Solution 2

Another way just as @schwobaseggl suggest but using full set methods:

import string
ascii = string.ascii_uppercase + string.ascii_lowercase
if set(my_string).issubset(ascii):
    #myString is ascii

Solution 3

re appears to be quite fast:

import re

# to check whether any outside ranges (->MatchObject) / all in ranges (->None)
nonletter = re.compile('[^a-zA-Z]').search

# to check whether any in ranges (->MatchObject) / all outside ranges (->None)
letter = re.compile('[a-zA-Z]').search

bool(nonletter(myString1))
# True

bool(nonletter(myString2))
# True

bool(nonletter(myString2[:-1]))
# False

Benchmarks for OP's two examples and a positive one (set is @schwobaseggl setset is @DanielSanchez):

Австрия
re               0.48832818 ± 0.09022105 µs
set              0.58745548 ± 0.01759877 µs
setset           0.81759223 ± 0.03595184 µs
AustriЯ
re               0.51960442 ± 0.01881561 µs
set              1.03043942 ± 0.02453405 µs
setset           0.54060076 ± 0.01505265 µs
tralala
re               0.27832978 ± 0.01462306 µs
set              0.88285526 ± 0.03792728 µs
setset           0.43238688 ± 0.01847240 µs

Benchmark code:

import types
from timeit import timeit
import re
import string
import numpy as np

def mnsd(trials):
    return '{:1.8f} \u00b1 {:10.8f} \u00b5s'.format(np.mean(trials), np.std(trials))

nonletter = re.compile('[^a-zA-Z]').search
letterset = set(string.ascii_letters)

def f_re(stri):
    return not nonletter(stri)

def f_set(stri):
    return all(x in letterset for x in stri)

def f_setset(stri):
    return set(stri).issubset(letterset)

for stri in ('Австрия', 'AustriЯ', 'tralala'):
    ref = f_re(stri)
    print(stri)
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert ref == func(stri)
            print("{:16s}".format(name[2:]), mnsd([timeit(
                'f(stri)', globals={'f':func, 'stri':stri}, number=1000) * 1000 for i in range(1000)]))

        except:
            print("{:16s} apparently failed".format(name[2:]))

Solution 4

There's no way to avoid iterating. However, you can certainly make it more efficient by doing not 65 <= ord(s) <= 91 rather than comparing against a range.

Share:
10,651
Mikhail_Sam
Author by

Mikhail_Sam

I'm proud to get my first Tag Badge: 221th bronze matlab!

Updated on June 15, 2022

Comments

  • Mikhail_Sam
    Mikhail_Sam over 1 year

    I have a string containing unicode symbols (cyrillic):

    myString1 = 'Австрия'
    myString2 = 'AustriЯ'
    

    I want to check if all the elements in the string are English (ASCII). Now I'm using a loop:

    for char in myString1:
        if ord(s) not in range(65,91):
             break
    

    So if I find the first non-English element I break the loop. But for the given example you can see the string can contain a lot of English symbols and unicode at the end. In this way I will check the whole string. Furthermore, If all the string is in English I still check every char.

    Is there any more efficient way to do this? I'm thinking about something like:

    if any(myString[:]) is not in range(65,91)