How to get rid of punctuation using NLTK tokenizer?

229,893

Solution 1

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

Solution 2

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).

Solution 3

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

Solution 4

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

Solution 5

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

...and then if you wish, you can replace certain tokens such as 'm with am.

Share:
229,893
lizarisk
Author by

lizarisk

Updated on January 26, 2020

Comments

  • lizarisk
    lizarisk over 4 years

    I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.

  • rmalouf
    rmalouf about 11 years
    Most of the complexity involved in the Penn Treebank tokenizer has to do with the proper handling of punctuation. Why use an expensive tokenizer that handles punctuation well if you're only going to strip out the punctuation?
  • Kurt Bourbaki
    Kurt Bourbaki almost 9 years
    word_tokenize is a function that returns [token for sent in sent_tokenize(text, language) for token in _treebank_word_tokenize(sent)]. So I think that your answer is doing what nltk already does: using sent_tokenize() before using word_tokenize(). At least this is for nltk3.
  • sffc
    sffc almost 9 years
    Note that if you use this option, you lose natural language features special to word_tokenize like splitting apart contractions. You can naively split on the regex \w+ without any need for the NLTK.
  • Sadık
    Sadık over 8 years
    why converting tokens to text?
  • Ciprian Tomoiagă
    Ciprian Tomoiagă over 7 years
    @rmalouf because you don't need punctuation-only tokens ? So you want did and n't but not .
  • MikeL
    MikeL about 7 years
    Just be aware that using this method you will lose the word "not" in cases like "can't" or "don't", that may be very important for understanding and classifying the sentence. It is better using sentence.translate(string.maketrans("", "", ), chars_to_remove), where chars_to_remove can be ".,':;!?"
  • Diego Ferri
    Diego Ferri over 6 years
    Beware that this solution kills contractions. That is because word_tokenize use the standard tokenizer, TreebankWordTokenizer, that splits contractions (e.g. can't to (ca, n't). However n't is not alphanumeric and get lost in the process.
  • Admin
    Admin over 5 years
    Thank you very much
  • C.J. Jackson
    C.J. Jackson over 5 years
    This will also remove things like ... and -- while preserving contractions, which s.translate(None, string.punctuation) won't
  • finiteautomata
    finiteautomata over 5 years
    To illustrate @sffc comment, you might lose words such as "Mr."
  • Johnny
    Johnny over 5 years
    Remove all punctuation using the list expression that also works too. a = "*fa,fd.1lk#$" print("".join([w for w in a if w not in string.punctuation]))
  • zipline86
    zipline86 about 5 years
    @MikeL You can't get around words like "can't" and "don't" by import contractions and contractions.fix(sentence_here) before tokanizing. It will turn "can't" into "cannot" and "don't" into "do not".
  • Md. Ashikur Rahman
    Md. Ashikur Rahman over 4 years
    its replacing ' n't ' to 't' how to get rid of this?
  • Rishabh Gupta
    Rishabh Gupta about 4 years
    This one creates one token for each letter.
  • RandomWalker
    RandomWalker over 2 years
    This approach no loner works in python >= 3.1, as the translate method only takes exactly one argument. Please refer to this question if you still want to work with the translate method.