Split Strings into words with multiple word boundary delimiters

830,024

Solution 1

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Solution 2

re.split()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

Solution 3

Another quick way to do this without a regexp is to replace the characters first, as below:

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']

Solution 4

So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient re module:

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

where:

  • the […] matches one of the separators listed inside,
  • the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
  • the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched single-character separators), and
  • filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).

This re.split() precisely "splits with multiple separators", as asked for in the question title.

This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74's answer).

The re module is much more efficient (in speed and concision) than doing Python loops and tests "by hand"!

Solution 5

Another way, without regex

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()
Share:
830,024
ooboo
Author by

ooboo

Updated on January 19, 2021

Comments

  • ooboo
    ooboo over 3 years

    I think what I want to do is a fairly common task but I've found no reference on the web. I have text with punctuation, and I want a list of the words.

    "Hey, you - what are you doing here!?"
    

    should be

    ['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
    

    But Python's str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?

  • ooboo
    ooboo almost 15 years
    Thanks. Still interested, though - how can I implement the algorithm used in this module? And why does it not appear in the string module?
  • RichieHindle
    RichieHindle almost 15 years
    I don't know why the string module doesn't have a multi-character split. Maybe it's considered complex enough to be in the realm of regular expressions. As for "how can I implement the algorithm", I'm not sure what you mean... it's there in the re module - just use it.
  • RichieHindle
    RichieHindle almost 15 years
    Regular expressions can be daunting at first, but are very powerful. The regular expression '\w+' means "a word character (a-z etc.) repeated one or more times". There's a HOWTO on Python regular expressions here: amk.ca/python/howto/regex
  • ooboo
    ooboo almost 15 years
    I got that - I don't mean how to use the re module (it's pretty complicated in itself) but how is it implemented? split() is rather straightforward to program manually, this is much more difficult...
  • RichieHindle
    RichieHindle almost 15 years
    You want to know how the re module itself works? I can't help you with that I'm afraid - I've never looked at its innards, and my Computer Science degree was a very long time ago. 8-)
  • ooboo
    ooboo almost 15 years
    I'm doing my CS1 so I've got a long way to go... It seems very difficult, at first glance, actually, harder than TSP etc. :)
  • crizCraig
    crizCraig almost 13 years
    I like this. Just a note, the order of separators matters. Sorry if that's obvious.
  • Vlad the Impala
    Vlad the Impala over 12 years
    Neat Haskell solution, but IMO this can be written more clearly without mappend in Python.
  • ninjagecko
    ninjagecko over 12 years
    @Goose: the point was that the 2-line function map_then_append can be used to make a problem a 2-liner, as well as many other problems much easier to write. Most of the other solutions use the regular expression re module, which isn't python. But I have been unhappy with how I make my answer seem inelegant and bloaty when it's really concise... I'm going to edit it...
  • heltonbiker
    heltonbiker over 12 years
    @ooboo : If you are into CS, so you should want to master regex as much as a samurai would want to master a sharp sword.
  • Emil Stenström
    Emil Stenström over 12 years
    This solution have the advantage of being easily adapted to split on underscores too, something the findall solution does not: print re.split("\W+|_", "Testing this_thing")' yields: ['Testing', 'this', 'thing']
  • Christopher Ramírez
    Christopher Ramírez almost 12 years
    This solution is actually better than the accepted one. It works with no ASCII chars, try "Hey, you - what are you doing here María!?". The accepted solution will not work with the previous example.
  • Andy Baker
    Andy Baker over 11 years
    Quick and dirty but perfect for my case (my separators were a small, known set)
  • Rafael S. Calsaverini
    Rafael S. Calsaverini over 11 years
    I made a test here, and if you need to use unicode, using patt = re.compile(ur'\w+', re.UNICODE); patt.findall(S) is faster than translate, because you must encode the string before applying transform, and decode each item in the list after the split to go back to unicode.
  • cedbeu
    cedbeu about 11 years
    I think there is a small issue here ... Your code will append characters that are separated with punctuation and thus won't split them ... If I'm not wrong, your last line should be: ''.join([o if not o in string.punctuation else ' ' for o in s]).split()
  • Daniel H
    Daniel H about 11 years
    The regular expression library can be made to accept Unicode conventions for characters if necessary. Additionally, this has the same problem the accepted solution used to have: as it is now, it splits on apostrophes. You may want o for o in s if (o in not string.punctuation or o == "'"), but then it's getting too complicated for a one-liner if we add in cedbeu's patch also.
  • rav
    rav almost 11 years
    The new approach will allow words which contains only ' char.
  • ninjagecko
    ninjagecko almost 11 years
    Clever, should work on all English grammatical constructs I can think of except an em-dash with no spaces—this, for example. (Workaroundable.)
  • Lyndsy Simon
    Lyndsy Simon almost 11 years
    This also doesn't handle unicode very well - the apostrophe used above is U+0027, which is the one on en_US keyboards. There is also U+2019, which Unicode says is the preferred apostrophe representation. I often see this character in text pasted from other sources. A regex could be written that looks for punctuation adjacent to whitespace or the beginning or end of a line. I may do that when I get a moment :)
  • Jesse Dhillon
    Jesse Dhillon over 10 years
    This isn't the answer to the question. This is an answer to a different question, that happens to work for this particular situation. It's as if someone asked "how do I make a left turn" and the top-voted answer was "take the next three right turns." It works for certain intersections, but it doesn't give the needed answer. Ironically, the answer is in re, just not findall. The answer below giving re.split() is superior.
  • Stefan van den Akker
    Stefan van den Akker about 10 years
    There is another issue here. Even when we take into account the changes of @cedbeu, this code doesn't work if the string is something like "First Name,Last Name,Street Address,City,State,Zip Code" and we want to split only on a comma ,. Desired output would be: ['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code'] What we get instead:['First', 'Name', 'Last', 'Name', 'Street', 'Address', 'City', 'State', 'Zip', 'Code']
  • hobs
    hobs about 10 years
    You can one-liner the translate implementation and ensure that S isn't among the splitters with: s.translate(''.join([(chr(i) if chr(i) not in seps else seps[0]) for i in range(256)])).split(seps[0])
  • tudor -Reinstate Monica-
    tudor -Reinstate Monica- almost 10 years
    Perfect for the case where you don't have access to the RE library, such as certain small microcontrollers. :-)
  • Bruno Feroleto
    Bruno Feroleto almost 10 years
    I agree, the \w and \W solutions are not an answer to (the title of) the question. Note that in your answer, | should be removed (you're thinking of expr0|expr1 instead of [char0 char1…]). Furthermore, there is no need to compile() the regular expression.
  • Bruno Feroleto
    Bruno Feroleto over 9 years
    Why not use the re module, which is both way faster and clearer (not that regular expressions are especially clear, but because it is way shorter and direct)?
  • BartoszKP
    BartoszKP over 9 years
    "I can't find any solution that does efficiently what the title of the questions literally asks" - second answer does that, posted 5 years ago: stackoverflow.com/a/1059601/2642204.
  • Bruno Feroleto
    Bruno Feroleto over 9 years
    This answer does not split at delimiters (from a set of multiple delimiters): it instead splits at anything that's not alphanumeric. That said, I agree that the intent of the original poster is probably to keep only the words, instead of removing some punctuation marks.
  • GravityWell
    GravityWell over 9 years
    EOL: I think this answer does split on a set of multiple delimeters. If you add non-alphanumerics to the string that are not specified, like underscore, they are not split, as expected.
  • Bruno Feroleto
    Bruno Feroleto over 9 years
    @GravityWell: I am not sure I understand: can you give a concrete example?
  • GravityWell
    GravityWell over 9 years
    @EOL: I just realized I was confused by your comment "This answer does not split..." I thought "this" referred to your re.split answer, but I now realize you meant gimel's answer. I think THIS answer (the answer to which I'm commenting) is the best answer :)
  • Adam Hughes
    Adam Hughes over 9 years
    I think this is more explicit than RE as well, so it's kind of noob friendly. Sometimes don't need general solution to everything
  • Bruno Feroleto
    Bruno Feroleto almost 9 years
    This solution is terribly inefficient: first the list is deconstructed into individual characters, then the whole set of punctuation characters is gone through for each single characters in the original string, then the characters are assembled back, and then split again. All this "movement" is very complicated, too, compared to a regular expression-based solution: even if speed does not matter in a given application, there is no need for a complicated solution. Since the re module is standard and gives both legibility and speed, I don't see why it should be eschewed.
  • JayJay123
    JayJay123 almost 9 years
    Awesome. I had a .split() in a multiple input situation, and needed to catch when the user, me, separated the inputs with a space and not a comma. I was about to give up and recast with re, but your .replace() solution hit the nail on the head. Thanks.
  • Mark Amery
    Mark Amery almost 9 years
    @JesseDhillon "take all substrings consisting of a sequence of word characters" and "split on all substrings consisting of a sequence of non-word characters" are literally just different ways of expressing the same operation; I'm not sure why you'd call either answer superior.
  • szeta
    szeta over 8 years
    + for showing how to treat multiple subsequent delimiters as one. Thanks!
  • TMWP
    TMWP about 7 years
    This is an old post now but it is helping me today. Why the ' edit? I tried it with and without and saw no effect my windows 7 machine with Python 2.7. I also do not see that character mentioned in the regex cheat sheets I am working off. What does it do?
  • RichieHindle
    RichieHindle about 7 years
    @TMWP: The apostophe means that a word like don't is treated as a single word, rather than being split into don and t.
  • TMWP
    TMWP about 7 years
    That explains it. My test sample did not include any contractions so I had nothing inherently in what I was trying to highlight why the ' was there. Thanks for clarifying. Going to change my code now. :-)
  • TMWP
    TMWP about 7 years
    The irony here is the reason this answer is not getting the most votes ... there are technically correct answers & then there is what the original requester is looking for (what they mean rather than what they say). This is a great answer and I've copied it for when I need it. And yet, for me, the top rated answer solves a problem that is very like what the poster was working on, quickly, cleanly and w/ minimal code. If a single answer had posted both solutions, I would have voted 4 that. Which 1 is better depends on what u r actually trying to do (not the "how-to" quest being asked). :-)
  • Ahmed Amr
    Ahmed Amr about 7 years
    it will get you wrong answer when you don't want to split on spaces and you want to split on other characters.
  • Rick
    Rick almost 7 years
    is this supposed to be working in Python as-written? my fragments result is just a list of the characters in the string (including the tokens).
  • ninjagecko
    ninjagecko almost 7 years
    @RickTeachey: it works for me in both python2 and python3.
  • Rick
    Rick almost 7 years
    hmmmm. Maybe the example is a bit ambiguous. I have tried the code in the answer all sorts of different ways- including having fragments = ['the,string'], fragments = 'the,string', or fragments = list('the,string') and none of them are producing the right output.
  • Bruno Feroleto
    Bruno Feroleto almost 7 years
    This is indeed tedious, but also quite slow, as the whole string is gone through for each new character to be removed. Regular expressions are really not hard for this case (see, e.g., my answer) and are arguably meant to handle this situation (they are both fast and concise and, I would say, legible in this case).
  • MarSoft
    MarSoft over 6 years
    Hm, one another method is to use str.translate - it is not unicode-capable but is most likely faster than other methods and as such might be good in some cases: replacements=',-!?'; import string; my_str = my_str.translate(string.maketrans(replacements, ' ' * len(replacements))) Also here it is mandatory to have replacements as a string of characters, not tuple or list.
  • Taylor D. Edmiston
    Taylor D. Edmiston over 6 years
    @MarSoft Thanks! I mentioned that one at the top of the answer but decided not to add it since existing answers already discussed it well.
  • pprzemek
    pprzemek over 6 years
    None taken. You're comparing apples and oranges. ;) my solution in python 3 still works ;P and has support for multi-char separators. :) try doing that in simple manner without allocating a new string. :) but true, mine is limited to parsing command line params and not a book for example.
  • Scott Morken
    Scott Morken over 6 years
    A common use case of string splitting is removing empty string entries from the final result. Is it possible to do that with this method? re.split('\W+', ' a b c ') results in ['', 'a', 'b', 'c', '']
  • Edheldil
    Edheldil about 6 years
    @ScottMorken I suggest st. like [ e for e in re.split(r'\W+', ...) if e ] ... or possibly first do ' a b c '.strip()
  • Harsha Biyani
    Harsha Biyani about 6 years
    what if I have to split using word?
  • Admin
    Admin about 6 years
    @EOL I am trying to split on either > or < or =, whichever comes first in the passed string. using filter(None, re.split(">|<", feature_name)) but my output is <filter at 0x1ec49493f98> any advise on how to actually have the string
  • Bruno Feroleto
    Bruno Feroleto about 6 years
    You must be using Python 3, where filter() constructs an iterator and not a list. You can reproduce Python 2's behavior by wrapping the expression with list().
  • Przemek D
    Przemek D almost 6 years
    Much clearer than a regex. Plus, I don't really feel like importing a whole module just to perform a single, seemingly simple operation.
  • Bruno Feroleto
    Bruno Feroleto almost 6 years
    The translate() and maketrans() methods of strings are interesting, but this method fails to "split at delimiters" (or whitespace): for example, "There was a big cave-in" will incorrectly produce the word "cavein" instead of the expected "cave" and "in"… Thus, this does not do what the question asks for.
  • Jeremy Anifacc
    Jeremy Anifacc almost 6 years
    Just like what @EricLebigot commented. The method above does not do what the question asks for very well.
  • Little Bobby Tables
    Little Bobby Tables over 5 years
    Plays nicely with Pandas split string method - cute
  • Frank Vel
    Frank Vel over 5 years
    @ArtOfWarfare It is common to use the shift key to do the opposite of something. ctrl+z undo vs. ctrl+shift+z for redo. So shift w, or W, would be the opposite of w.
  • Kranach
    Kranach over 5 years
    This solution doesn't work if you want to split by non-white character.
  • Kranach
    Kranach over 5 years
    This answer should be at top - it is the only one that precisely answers the question title.
  • Saurav Mukherjee
    Saurav Mukherjee over 5 years
    print re.findall(r"[\w\-\_']+", DATA) is more appropriate as it will include the words with hyphen and underscores within them.
  • kushy
    kushy over 5 years
    Pretty clever and nice solution. Might not be the most 'elegant' way to do it, but it requires no additional imports and will work with most similar cases, so in a way, it is actually pretty elegant and beautiful too.
  • nyanpasu64
    nyanpasu64 over 5 years
    Is this supposed to be r'\W+' (raw strings)?
  • alancalvitti
    alancalvitti over 5 years
    @EricLebigot, what if the delimiter consists of a sequence of characters eg "--" (2 dashes), or ":=" ?
  • Bruno Feroleto
    Bruno Feroleto over 5 years
    … then you can simply list the strings that match by separating them with "pipes": "--|:=|[…]+".
  • zar3bski
    zar3bski about 5 years
    It&#39;s not a matter of verbose but, rather the fact of importing an entire library (which I love, BTW) to perform a simple task after converting a string to a panda series. Not very &quot;Occam friendly&quot;.
  • Artemis
    Artemis about 5 years
    @ArtOfWarfare Ah, but not always: \a means system bell character, \A means start of line. IKR
  • uhoh
    uhoh almost 5 years
    @JesseDhillon agreed, I'll use this answer. However, apparently three right turns is the best answer to the other question! ;-) theconversation.com/… and youtu.be/gMRp4RqEsHk
  • Prokop Hapala
    Prokop Hapala over 4 years
    you say "does not produce a new string", meaning it works inplace on given string? I tested it now with python 2.7 and it does not modify oroginal string and returns new one.
  • revliscano
    revliscano about 4 years
    Great answer as for Python >= 3.6
  • pprzemek
    pprzemek about 3 years
    There are many versions of Python, not just the one on python.org. not all of them have re module, especially if you go embedding, than you cut whatever you can
  • sanchaz
    sanchaz almost 3 years
    @SauravMukherjee not really. re.split(r'[^a-zA-Z0-9-\'_]+', DATA) would be more appropriate.
  • naught101
    naught101 almost 3 years
    @TWMP: agreed, but the inappropriateness of the question to the OP's actual problem at hand is a fault of the question, not the answers. The technically correct answer should be upvoted, on the basis that it best answers the question that people who come here via searching are looking for.
  • Futal
    Futal over 2 years
    string.translate and string.maketrans are not available in Python 3 but only in Python 2.