Splitting a string into words and punctuation


Solution 1

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

Solution 2

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk
sentence = "help, me"

Solution 3

This worked for me

import re

i = 'Sandra went to the hallway.!!'
l = re.split('(\W+?)', i)

empty = ['', ' ']
l = [el for el in l if el not in empty]

['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']

Solution 4

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
            word = ''
            word = ''.join([word,char])

        if word:
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']
Author by


Updated on July 05, 2022


  • Admin
    Admin almost 2 years

    I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

    For instance:

    >>> c = "help, me"
    >>> print c.split()
    ['help,', 'me']

    What I really want the list to look like is:

    ['help', ',', 'me']

    So, I want the string split at whitespace with the punctuation split from the words.

    I've tried to parse the string first and then run the split:

    >>> for character in c:
    ...     if character in ".,;!?":
    ...             outputCharacter = " %s" % character
    ...     else:
    ...             outputCharacter = character
    ...     separatedPunctuation += outputCharacter
    >>> print separatedPunctuation
    help , me
    >>> print separatedPunctuation.split()
    ['help', ',', 'me']

    This produces the result I want, but is painfully slow on large files.

    Is there a way to do this more efficiently?

  • Admin
    Admin over 15 years
    i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.
  • Admin
    Admin over 15 years
    With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.
  • rloth
    rloth over 9 years
    Upvoted because the \w+|[^\w\s] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary
  • Roland Pihlakas
    Roland Pihlakas almost 7 years
    You need to call if word: result.append(word) after the loop ends, else the last word is not in result.
  • sh37211
    sh37211 over 2 years
    The (original) above code fails for me with an error about needing the PUNKT resource. I'll suggest an edit to include nltk.download('punkt') after import nltk.