How to get rid of punctuation using NLTK tokenizer?
Solution 1
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
Solution 2
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
P.S. string module have some other sets of elements that can be removed (like digits).
Solution 3
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. @ sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
Solution 4
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
Solution 5
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't
will be destroyed into pieces (such as can
and t
) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
Hence the solution is to tokenise and then remove punctuation tokens.
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'm
with am
.
lizarisk
Updated on January 26, 2020Comments
-
lizarisk over 4 years
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use
nltk.word_tokenize()
, I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Alsoword_tokenize
doesn't work with multiple sentences: dots are added to the last word. -
rmalouf about 11 yearsMost of the complexity involved in the Penn Treebank tokenizer has to do with the proper handling of punctuation. Why use an expensive tokenizer that handles punctuation well if you're only going to strip out the punctuation?
-
Kurt Bourbaki almost 9 years
word_tokenize
is a function that returns[token for sent in sent_tokenize(text, language) for token in _treebank_word_tokenize(sent)]
. So I think that your answer is doing what nltk already does: usingsent_tokenize()
before usingword_tokenize()
. At least this is for nltk3. -
sffc almost 9 yearsNote that if you use this option, you lose natural language features special to
word_tokenize
like splitting apart contractions. You can naively split on the regex\w+
without any need for the NLTK. -
Sadık over 8 yearswhy converting tokens to text?
-
Ciprian Tomoiagă over 7 years@rmalouf because you don't need punctuation-only tokens ? So you want
did
andn't
but not.
-
MikeL about 7 yearsJust be aware that using this method you will lose the word "not" in cases like "can't" or "don't", that may be very important for understanding and classifying the sentence. It is better using sentence.translate(string.maketrans("", "", ), chars_to_remove), where chars_to_remove can be ".,':;!?"
-
Diego Ferri over 6 yearsBeware that this solution kills contractions. That is because
word_tokenize
use the standard tokenizer,TreebankWordTokenizer
, that splits contractions (e.g.can't
to (ca
,n't
). Howevern't
is not alphanumeric and get lost in the process. -
Admin over 5 yearsThank you very much
-
C.J. Jackson over 5 yearsThis will also remove things like
...
and--
while preserving contractions, whichs.translate(None, string.punctuation)
won't -
finiteautomata over 5 yearsTo illustrate @sffc comment, you might lose words such as "Mr."
-
Johnny over 5 yearsRemove all punctuation using the list expression that also works too.
a = "*fa,fd.1lk#$" print("".join([w for w in a if w not in string.punctuation]))
-
zipline86 about 5 years@MikeL You can't get around words like "can't" and "don't" by import contractions and contractions.fix(sentence_here) before tokanizing. It will turn "can't" into "cannot" and "don't" into "do not".
-
Md. Ashikur Rahman over 4 yearsits replacing ' n't ' to 't' how to get rid of this?
-
Rishabh Gupta about 4 yearsThis one creates one token for each letter.
-
RandomWalker over 2 yearsThis approach no loner works in python >= 3.1, as the
translate
method only takes exactly one argument. Please refer to this question if you still want to work with thetranslate
method.