In Python, how do I remove from a list any element containing certain kinds of characters?

33,797

Solution 1

I think your regex is incorrect, to match all entries that contain all-cap words with three or more characters, you should use something like this with re.search:

regex = re.compile(r'\b[A-Z]{3,}\b')

With that you can filter using a list comprehension or the filter built-in function:

full = ['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']
regex = re.compile(r'\b[A-Z]{3,}\b')
# use only one of the following lines, whichever you prefer
filtered = filter(lambda i: not regex.search(i), full)
filtered = [i for i in full if not regex.search(i)]

Results in the following list (which I think is what you are looking for:

>>> pprint.pprint(filtered)
['Organization name} ',
 '> (777) 777-7777} ',
 ' class="lsn-mB6 adr">1 Address, MA 02114 } ',
 'Other organization} ',
 '> (555) 555-5555} ',
 ' class="lsn-mB6 adr">301 Address, MA 02121 } ',
 'Organization} ']

Solution 2

First, store your regex, then use a list comprehension:

regex = re.compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
okay_items = [x for x in all_items if not regex.match(x)]

Solution 3

Or the very same but without compiling regex:

from re import match

ll = ['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']

filteredData = [x for x in ll if not match(r'[^a-z]*[A-Z][^a-z]*\w{3,}', x)]

Edited:

from re import compile

rex = compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
filteredData = [x for x in ll if not rex.match(x)]

Solution 4

without regex

def isNotMonster(x):
    return not any((len(word) > 2) and (word == word.upper()) for word in x.split())

okay_items = filter(isNotMonster, all_items)
Share:
33,797
RSid
Author by

RSid

I'm interested in distributed systems, domain driven design, and weird, inspiring tech. I've worked in ROR, .Net, and Python, generally on top of relational databases. I'm fullstack, including ops, and there's nothing quite as fun as diving into something completely new.

Updated on April 29, 2020

Comments

  • RSid
    RSid about 4 years

    Apologies if this is a simple question, I'm still pretty new to this, but I've spent a while looking for an answer and haven't found anything. I have a list that looks something like this horrifying mess:

    ['Organization name} ', '> (777) 777-7777} ', ' class="lsn-mB6 adr">1 Address, MA 02114 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', 'Other organization} ', '> (555) 555-5555} ', ' class="lsn-mB6 adr">301 Address, MA 02121 } ', ' class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO CLAIM YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4715945\'); ', 'Organization} ']
    

    And I need to process it so that HTML.py can turn the information in it into a table. For some reason, HTML.py simply can't handle the monster elements (eg. 'class="lsn-serpListRadius lsn-fr">.2 Miles} MORE INFO YOUR LISTING MAP if (typeof(serps) !== \'undefined\') serps.arrArticleIds.push(\'4603114\'); ', etc). Fortunately for me, I don't actually care about the information in the monster elements and want to get rid of them.

    I tried writing a regex that would match all more-than-two-letter all-caps words, to identify the monster elements, and got this:

    re.compile('[^a-z]*[A-Z][^a-z]*\w{3,}')
    

    But I don't know how to apply that to deleting the elements containing matches to that regex from the list. How would I do that/is that the right way to go about it?

  • Amber
    Amber almost 13 years
    If you're going to be running the same regex against many items in a list, you should compile it. Granted, Python is usually smart enough to compile it for you and cache it, but it's good to be explicit.
  • RSid
    RSid almost 13 years
    This seemed like it should work, but for some reason it returns a list without the org names when using my original regex and when using F.J's it just spits out the same list I put in. Not sure why.
  • RSid
    RSid almost 13 years
    This returns only the names of the organizations--which actually is also helpful to me right now, so separately thanks, but it isn't what I was looking for.
  • NumenorForLife
    NumenorForLife almost 9 years
    Is there any difference in speed between the two lines?