Splitting a string into words and punctuation

95,941

Solution 1

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

Solution 2

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)

Solution 3

This worked for me

import re

i = 'Sandra went to the hallway.!!'
l = re.split('(\W+?)', i)
print(l)

empty = ['', ' ']
l = [el for el in l if el not in empty]
print(l)

Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']

Solution 4

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']
Share:
95,941
Admin
Author by

Admin

Updated on July 05, 2022

Comments

  • Admin
    Admin almost 2 years

    I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

    For instance:

    >>> c = "help, me"
    >>> print c.split()
    ['help,', 'me']
    

    What I really want the list to look like is:

    ['help', ',', 'me']
    

    So, I want the string split at whitespace with the punctuation split from the words.

    I've tried to parse the string first and then run the split:

    >>> for character in c:
    ...     if character in ".,;!?":
    ...             outputCharacter = " %s" % character
    ...     else:
    ...             outputCharacter = character
    ...     separatedPunctuation += outputCharacter
    >>> print separatedPunctuation
    help , me
    >>> print separatedPunctuation.split()
    ['help', ',', 'me']
    

    This produces the result I want, but is painfully slow on large files.

    Is there a way to do this more efficiently?

  • Admin
    Admin over 15 years
    i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.
  • Admin
    Admin over 15 years
    With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.
  • rloth
    rloth over 9 years
    Upvoted because the \w+|[^\w\s] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary
  • Roland Pihlakas
    Roland Pihlakas almost 7 years
    You need to call if word: result.append(word) after the loop ends, else the last word is not in result.
  • sh37211
    sh37211 over 2 years
    The (original) above code fails for me with an error about needing the PUNKT resource. I'll suggest an edit to include nltk.download('punkt') after import nltk.