How to extract all UPPER from a string? Python

31,260

Solution 1

Using list comprehension:

>>> s = 'abcdefgABCDEFGHIJKLMNOP'
>>> ''.join([c for c in s if c.isupper()])
'ABCDEFGHIJKLMNOP'

Using generator expression:

>>> ''.join(c for c in s if c.isupper())
'ABCDEFGHIJKLMNOP

You can also do it using regular expressions:

>>> re.sub('[^A-Z]', '', s)
'ABCDEFGHIJKLMNOP'

Solution 2

import string
s = 'abcdefgABCDEFGHIJKLMNOP'
s.translate(None,string.ascii_lowercase)

string.translate(s, table[, deletechars]) function will delete all characters from the string that are in deletechars, a list of characters. Then, the string will be translated using table (we are not using it in this case).

To remove only the lower case letters, you need to pass string.ascii_lowercase as the list of letters to be deleted.

The table is None because when the table is None, only the character deletion step will be performed.

Solution 3

Higher order functions to the rescue!

filter(str.isupper, "abcdefgABCDEFGHIJKLMNOP")

EDIT: In case you don't know what filter does: filter takes a function and an iterable, and then applies the function to every element in the iterable. It keeps all of the values that return true and throws out all of the rest. Therefore, this will return "ABCDEFGHIJKLMNOP".

Solution 4

You could use a more functional approach

>>> s = 'abcdefgABCDEFGHIJKLMNOP'
>>> filter(str.isupper, s)
'ABCDEFGHIJKLMNOP'

Solution 5

or use regex ... this is an easy answer

import re
print ''.join(re.findall('[A-Z]+',my_string))

just for comparison

In [6]: %timeit filter(str.isupper,my_list)
1000 loops, best of 3: 774 us per loop

In [7]: %timeit ''.join(re.findall('[A-Z]+',my_list))
1000 loops, best of 3: 563 us per loop

In [8]: %timeit re.sub('[^A-Z]', '', my_list)
1000 loops, best of 3: 869 us per loop

In [10]: %timeit ''.join(c for c in my_list if c.isupper())
1000 loops, best of 3: 1.05 ms per loop

so this join plus findall is the fastest method (per ipython %timeit (python 2.6)) , using a 10000 character long identical string

edit: Or not

In [12]: %timeit  my_list.translate(None,string.ascii_lowercase)
10000 loops, best of 3: 51.6 us per loop
Share:
31,260
O.rka
Author by

O.rka

I am an academic researcher studying machine-learning and microorganisms

Updated on July 09, 2022

Comments

  • O.rka
    O.rka almost 2 years
    #input
    my_string = 'abcdefgABCDEFGHIJKLMNOP'
    

    how would one extract all the UPPER from a string?

    #output
    my_upper = 'ABCDEFGHIJKLMNOP'
    
  • mgilson
    mgilson about 11 years
    I thought about posting this, but ultimately, it fails in too many cases (what about punctuation, non-printing characters, etc.)
  • abarnert
    abarnert about 11 years
    Removing all lowercase is only the same as subtracting all uppercase when the data is nothing but letters. The OP's one sample is all letters, so this might be appropriate—but not without explaining the difference.
  • user4815162342
    user4815162342 about 11 years
    this will return a list, not a string, though
  • abarnert
    abarnert about 11 years
    You need to add a join to make this work. (If you're absolutely sure that all of the uppercase characters are in a single run, you could use [0] instead, of course.)
  • Joran Beasley
    Joran Beasley about 11 years
    fixed :P ... just joined the output at the end
  • Joran Beasley
    Joran Beasley about 11 years
    this is better : filter(str.isupper,"abvABC")
  • Joran Beasley
    Joran Beasley about 11 years
    filter(str.isupper,"abvABC") lambdas slow down filters ... use builtins when you can :)
  • abarnert
    abarnert about 11 years
    You need a join here. Otherwise, you're going to return a list (2.x) or a filter iterator (3.x), not a string.
  • abarnert
    abarnert about 11 years
    This one also needs a join, since the OP wants a string, not a list or a filter iterator.
  • hatkirby
    hatkirby about 11 years
    This does actually return a string.
  • Joran Beasley
    Joran Beasley about 11 years
    this is by far the most time efficient method of all the ones given ... assuming it works for OP use case ...
  • abarnert
    abarnert about 11 years
    Also, it's a bit misleading to call filter "more functional" than a comprehension/genexp, at least without explanation. The language Python borrowed the latter from is Haskell, after all.
  • abarnert
    abarnert about 11 years
    If you need to deal with non-letters, but still only with ASCII you can define ascii_nonuppercase as, e.g., ''.join(c for c in string.printable if c not in string.ascii_uppercase), or just '0123456789abcdefghijklmnopqrstuvwxyz!"#$%&\'()*+,-./:;<=>?@‌​[\\]^_{|}~ \t\n\r\x0b\x0c'`, and then use that. If you need to deal with Unicode that's a non-starter, but otherwise, give Joran Beasley's timing, the slight extra complexity might be worth it.
  • abarnert
    abarnert about 11 years
    I wouldn't make any strong proclamations about efficiency based on a 10% difference without testing multiple Python versions, platforms, etc., and different input data. For example, I get similar results with CPython 2.7.2, but on 3.3.0, the genexp beats the regex by 5%, while with PyPy 1.9.0, the filter beats it by 20%. The order-of-magnitude gain of translate is more likely to be trustworthy, but even that drops to a 2:1 gain in a quick test with PyPy.