Converting a String to a List of Words?

235,025

Solution 1

Try this:

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

How it works:

From the docs :

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

so in our case :

pattern is any non-alphanumeric character.

[\w] means any alphanumeric character and is equal to the character set [a-zA-Z0-9_]

a to z, A to Z , 0 to 9 and underscore.

so we match any non-alphanumeric character and replace it with a space .

and then we split() it which splits string by space and converts it to a list

so 'hello-world'

becomes 'hello world'

with re.sub

and then ['hello' , 'world']

after split()

let me know if any doubts come up.

Solution 2

I think this is the simplest way for anyone else stumbling on this post given the late response:

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']

Solution 3

To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']

Solution 4

The most simple way:

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']

Solution 5

Using string.punctuation for completeness:

import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()

This handles newlines as well.

Share:
235,025
rectangletangle
Author by

rectangletangle

Updated on February 05, 2022

Comments

  • rectangletangle
    rectangletangle about 2 years

    I'm trying to convert a string to a list of words using python. I want to take something like the following:

    string = 'This is a string, with words!'
    

    Then convert to something like this :

    list = ['This', 'is', 'a', 'string', 'with', 'words']
    

    Notice the omission of punctuation and spaces. What would be the fastest way of going about this?

  • Levon
    Levon over 11 years
    You need to separate and eliminate the punctuation from the words (e.g., "string," and "words!"). As it, this does not meet OP's requirements.
  • Brōtsyorfuzthrāx
    Brōtsyorfuzthrāx over 9 years
    Remember to handle apostrophes and hyphens, too, since they're not included in \w.
  • Brōtsyorfuzthrāx
    Brōtsyorfuzthrāx over 9 years
    You may want to handle formatted apostrophes and non-breaking hyphens, too.
  • Ege
    Ege about 3 years
    string.split() is much easier
  • Coddy
    Coddy about 2 years
    What's the point of this solution if there exists a more optimal solution?