Splitting a string into words and punctuation
Solution 1
Here is a Unicode-aware version:
re.findall(r"\w+|[^\w\s]", text, re.UNICODE)
The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']
); the second catches individual non-word characters, ignoring whitespace.
Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']
). This appears to be standard in NLP, so I consider it a feature.
Solution 2
If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).
import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)
Solution 3
This worked for me
import re
i = 'Sandra went to the hallway.!!'
l = re.split('(\W+?)', i)
print(l)
empty = ['', ' ']
l = [el for el in l if el not in empty]
print(l)
Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']
Solution 4
Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.
This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.
import string
d = "Hello, I'm a string!"
result = []
word = ''
for char in d:
if char not in string.whitespace:
if char not in string.ascii_letters + "'":
if word:
result.append(word)
result.append(char)
word = ''
else:
word = ''.join([word,char])
else:
if word:
result.append(word)
word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']
Admin
Updated on July 05, 2022Comments
-
Admin almost 2 years
I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.
For instance:
>>> c = "help, me" >>> print c.split() ['help,', 'me']
What I really want the list to look like is:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.
I've tried to parse the string first and then run the split:
>>> for character in c: ... if character in ".,;!?": ... outputCharacter = " %s" % character ... else: ... outputCharacter = character ... separatedPunctuation += outputCharacter >>> print separatedPunctuation help , me >>> print separatedPunctuation.split() ['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.
Is there a way to do this more efficiently?
-
Admin over 15 yearsi have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.
-
Admin over 15 yearsWith tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.
-
rloth over 9 yearsUpvoted because the
\w+|[^\w\s]
construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary -
Roland Pihlakas almost 7 yearsYou need to call
if word: result.append(word)
after the loop ends, else the last word is not in result. -
sh37211 over 2 yearsThe (original) above code fails for me with an error about needing the PUNKT resource. I'll suggest an edit to include
nltk.download('punkt')
afterimport nltk
.