How to get rid of punctuation using NLTK tokenizer?

python nlp tokenize nltk

229,893

Solution 1

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

Solution 2

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).

Solution 3

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

Solution 4

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

Solution 5

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

...and then if you wish, you can replace certain tokens such as 'm with am.

View more solutions

229,893

Author by

lizarisk

Updated on January 26, 2020

Comments

lizarisk over 4 years

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.
rmalouf about 11 years

Most of the complexity involved in the Penn Treebank tokenizer has to do with the proper handling of punctuation. Why use an expensive tokenizer that handles punctuation well if you're only going to strip out the punctuation?
Kurt Bourbaki almost 9 years

word_tokenize is a function that returns [token for sent in sent_tokenize(text, language) for token in _treebank_word_tokenize(sent)]. So I think that your answer is doing what nltk already does: using sent_tokenize() before using word_tokenize(). At least this is for nltk3.
sffc almost 9 years

Note that if you use this option, you lose natural language features special to word_tokenize like splitting apart contractions. You can naively split on the regex \w+ without any need for the NLTK.
Sadık over 8 years

why converting tokens to text?
Ciprian Tomoiagă over 7 years

@rmalouf because you don't need punctuation-only tokens ? So you want did and n't but not .
MikeL about 7 years

Just be aware that using this method you will lose the word "not" in cases like "can't" or "don't", that may be very important for understanding and classifying the sentence. It is better using sentence.translate(string.maketrans("", "", ), chars_to_remove), where chars_to_remove can be ".,':;!?"
Diego Ferri over 6 years

Beware that this solution kills contractions. That is because word_tokenize use the standard tokenizer, TreebankWordTokenizer, that splits contractions (e.g. can't to (ca, n't). However n't is not alphanumeric and get lost in the process.
Admin over 5 years

Thank you very much
C.J. Jackson over 5 years

This will also remove things like ... and -- while preserving contractions, which s.translate(None, string.punctuation) won't
finiteautomata over 5 years

To illustrate @sffc comment, you might lose words such as "Mr."
Johnny over 5 years

Remove all punctuation using the list expression that also works too. a = "*fa,fd.1lk#$" print("".join([w for w in a if w not in string.punctuation]))
zipline86 about 5 years

@MikeL You can't get around words like "can't" and "don't" by import contractions and contractions.fix(sentence_here) before tokanizing. It will turn "can't" into "cannot" and "don't" into "do not".
Md. Ashikur Rahman over 4 years

its replacing ' n't ' to 't' how to get rid of this?
Rishabh Gupta about 4 years

This one creates one token for each letter.
RandomWalker over 2 years

This approach no loner works in python >= 3.1, as the translate method only takes exactly one argument. Please refer to this question if you still want to work with the translate method.