How to extract all UPPER from a string? Python
Solution 1
Using list comprehension:
>>> s = 'abcdefgABCDEFGHIJKLMNOP'
>>> ''.join([c for c in s if c.isupper()])
'ABCDEFGHIJKLMNOP'
Using generator expression:
>>> ''.join(c for c in s if c.isupper())
'ABCDEFGHIJKLMNOP
You can also do it using regular expressions:
>>> re.sub('[^A-Z]', '', s)
'ABCDEFGHIJKLMNOP'
Solution 2
import string
s = 'abcdefgABCDEFGHIJKLMNOP'
s.translate(None,string.ascii_lowercase)
string.translate(s, table[, deletechars]) function will delete all characters from the string that are in deletechars, a list of characters. Then, the string will be translated using table (we are not using it in this case).
To remove only the lower case letters, you need to pass string.ascii_lowercase as the list of letters to be deleted.
The table
is None because when the table is None
, only the character deletion step will be performed.
Solution 3
Higher order functions to the rescue!
filter(str.isupper, "abcdefgABCDEFGHIJKLMNOP")
EDIT: In case you don't know what filter does: filter takes a function and an iterable, and then applies the function to every element in the iterable. It keeps all of the values that return true and throws out all of the rest. Therefore, this will return "ABCDEFGHIJKLMNOP".
Solution 4
You could use a more functional approach
>>> s = 'abcdefgABCDEFGHIJKLMNOP'
>>> filter(str.isupper, s)
'ABCDEFGHIJKLMNOP'
Solution 5
or use regex ... this is an easy answer
import re
print ''.join(re.findall('[A-Z]+',my_string))
just for comparison
In [6]: %timeit filter(str.isupper,my_list)
1000 loops, best of 3: 774 us per loop
In [7]: %timeit ''.join(re.findall('[A-Z]+',my_list))
1000 loops, best of 3: 563 us per loop
In [8]: %timeit re.sub('[^A-Z]', '', my_list)
1000 loops, best of 3: 869 us per loop
In [10]: %timeit ''.join(c for c in my_list if c.isupper())
1000 loops, best of 3: 1.05 ms per loop
so this join plus findall is the fastest method (per ipython %timeit (python 2.6)) , using a 10000 character long identical string
edit: Or not
In [12]: %timeit my_list.translate(None,string.ascii_lowercase)
10000 loops, best of 3: 51.6 us per loop
O.rka
I am an academic researcher studying machine-learning and microorganisms
Updated on July 09, 2022Comments
-
O.rka almost 2 years
#input my_string = 'abcdefgABCDEFGHIJKLMNOP'
how would one extract all the UPPER from a string?
#output my_upper = 'ABCDEFGHIJKLMNOP'
-
mgilson about 11 yearsI thought about posting this, but ultimately, it fails in too many cases (what about punctuation, non-printing characters, etc.)
-
abarnert about 11 yearsRemoving all lowercase is only the same as subtracting all uppercase when the data is nothing but letters. The OP's one sample is all letters, so this might be appropriate—but not without explaining the difference.
-
user4815162342 about 11 yearsthis will return a list, not a string, though
-
abarnert about 11 yearsYou need to add a
join
to make this work. (If you're absolutely sure that all of the uppercase characters are in a single run, you could use[0]
instead, of course.) -
Joran Beasley about 11 yearsfixed :P ... just joined the output at the end
-
Joran Beasley about 11 yearsthis is better :
filter(str.isupper,"abvABC")
-
Joran Beasley about 11 years
filter(str.isupper,"abvABC")
lambdas slow down filters ... use builtins when you can :) -
abarnert about 11 yearsYou need a
join
here. Otherwise, you're going to return alist
(2.x) or a filter iterator (3.x), not a string. -
abarnert about 11 yearsThis one also needs a
join
, since the OP wants a string, not a list or a filter iterator. -
hatkirby about 11 yearsThis does actually return a string.
-
Joran Beasley about 11 yearsthis is by far the most time efficient method of all the ones given ... assuming it works for OP use case ...
-
abarnert about 11 yearsAlso, it's a bit misleading to call
filter
"more functional" than a comprehension/genexp, at least without explanation. The language Python borrowed the latter from is Haskell, after all. -
abarnert about 11 yearsIf you need to deal with non-letters, but still only with ASCII you can define
ascii_nonuppercase
as, e.g.,''.join(c for c in string.printable if c not in string.ascii_uppercase)
, or just'0123456789abcdefghijklmnopqrstuvwxyz!"#$%&\'()*+,-./:;<=>?@[\\]^_
{|}~ \t\n\r\x0b\x0c'`, and then use that. If you need to deal with Unicode that's a non-starter, but otherwise, give Joran Beasley's timing, the slight extra complexity might be worth it. -
abarnert about 11 yearsI wouldn't make any strong proclamations about efficiency based on a 10% difference without testing multiple Python versions, platforms, etc., and different input data. For example, I get similar results with CPython 2.7.2, but on 3.3.0, the genexp beats the regex by 5%, while with PyPy 1.9.0, the
filter
beats it by 20%. The order-of-magnitude gain oftranslate
is more likely to be trustworthy, but even that drops to a 2:1 gain in a quick test with PyPy.