Using regex to findall lowercase letters in string append to list. Python

10,725

Sounds like (I may be misunderstanding your question) you just need to capture runs of lowercase letters, rather than each individual lowercase letter. This is easy: just add the + quantifier to your regular expression.

for seq in sequences:
    lower_output.append(re.findall("[a-z]+", seq)) # add substrings

The + quantifier specifies that you want "at least one, and as many as you can find in a row" of the preceding expression (in this case '[a-z]'). So this will capture your full runs of lowercase letters all in one group, which should cause them to appear as you want them to in your output lists.

It gets a little big uglier if you want to preserve your list-of-list structure and get the indices as well, but it's still very simple:

for seq in sequences:
    matches = re.finditer("[a-z]+", seq) # List of Match objects.
    lower_output.append([match.group(0) for match in matches]) # add substrings
    lower_indx.append([match.start(0) for match in matches]) # add indices

print lower_output
>>> [['defgdefgdefg'], ['wowhello', 'onemore'], []]

print lower_indx
>>> [[9], [9, 23], []]
Share:
10,725
O.rka
Author by

O.rka

I am an academic researcher studying machine-learning and microorganisms

Updated on June 05, 2022

Comments

  • O.rka
    O.rka almost 2 years

    I'm looking for a way to get the lowercase values out of a string that has both uppercase and potentially lowercase letters

    here's an example

    sequences = ['CABCABCABdefgdefgdefgCABCAB','FEGFEGFEGwowhelloFEGFEGonemoreFEG','NONEARELOWERCASE'] #sequences with uppercase and potentially lowercase letters
    

    this is what i want to output

    upper_output = ['CABCABCABCABCAB','FEGFEGFEGFEGFEGFEG','NONEARELOWERCASE'] #the upper case letters joined together
    lower_output = [['defgdefgdefg'],['wowhello','onemore'],[]] #the lower case letters in lists within lists
    lower_indx = [[9],[9,23],[]] #where the lower case values occur in the original sequence
    

    so i want the lower_output list be a LIST of SUBLISTS. the SUBLISTS would have all the strings of lowercase letters .

    i was thinking of using regex . . .

    import re
    
    lower_indx = []
    
    for seq in sequences:
        lower_indx.append(re.findall("[a-z]", seq).start())
    
    print lower_indx
    

    for the lowercase lists i was trying:

    lower_output = []
    
    for seq in sequences:
        temp = ''
        temp = re.findall("[a-z]", seq)
        lower_output.append(temp)
    
    print lower_output
    

    but the values are not in separate lists (i still need to join them)

    [['d', 'e', 'f', 'g', 'd', 'e', 'f', 'g', 'd', 'e', 'f', 'g'], ['w', 'o', 'w', 'h', 'e', 'l', 'l', 'o', 'o', 'n', 'e', 'm', 'o', 'r', 'e'], []]
    
  • O.rka
    O.rka about 11 years
    i didn't even know that existed. any idea on how to get the index value ?
  • Henry Keiter
    Henry Keiter about 11 years
    @draconisthe0ry Certainly; I've updated my answer to include those.