Splitting a string into words and punctuation

python string split

95,941

Solution 1

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

Solution 2

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)

Solution 3

This worked for me

import re

i = 'Sandra went to the hallway.!!'
l = re.split('(\W+?)', i)
print(l)

empty = ['', ' ']
l = [el for el in l if el not in empty]
print(l)

Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']

Solution 4

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

View more solutions

95,941

Author by

Admin

Updated on July 05, 2022

Comments

Admin almost 2 years
I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:
```
>>> c = "help, me"
>>> print c.split()
['help,', 'me']
```
What I really want the list to look like is:
```
['help', ',', 'me']
```
So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:
```
>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']
```
This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?
Admin over 15 years

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.
Admin over 15 years

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.
rloth over 9 years

Upvoted because the \w+|[^\w\s] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary
Roland Pihlakas almost 7 years

You need to call if word: result.append(word) after the loop ends, else the last word is not in result.
sh37211 over 2 years

The (original) above code fails for me with an error about needing the PUNKT resource. I'll suggest an edit to include nltk.download('punkt') after import nltk.