NLTK python tokenizing a CSV file

python-2.7 csv nltk tokenize

16,542

As you can read in the Python csv documentation, csv.reader "returns a reader object which will iterate over lines in the given csvfile". In other words, if you want to tokenize the text in your csv file, you will have to go through the lines and the fields in those lines:

for line in reader:
    for field in line:
        tokens = word_tokenize(field)

Also, when you import word_tokenize at the beginning of your script, you should call it as word_tokenize, and not as nltk.word_tokenize. This also means you can drop the import nltk statement.

16,542

Author by

OAK

Updated on June 04, 2022

Comments

OAK about 2 years

I have began to experiment with Python and NLTK. I am experiencing a lengthy error message which I cannot find a solution to and would appreciate any insights you may have.

import nltk,csv,numpy 
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
tokenData = nltk.word_tokenize(reader)

I'm running Python 2.7 and the latest nltk package on OSX Yosemite. These are also two lines of code I attempted with no difference in results:

with open("Medium_Edited.csv", "rU") as csvfile:
tokenData = nltk.word_tokenize(reader)

These are the error messages I see:

Traceback (most recent call last):
  File "nltk_text.py", line 11, in <module>
    tokenData = nltk.word_tokenize(reader)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

Thanks in advance

OAK about 9 years

Thanks for the response, this is my edited code: code import csv import numpy as np from nltk import sent_tokenize, word_tokenize as word_tokenize, pos_tag reader = csv.reader(open('Milling_Final_Edited.csv', 'rU'), delimiter=',', quotechar='"') for line in reader: for field in line: tokens = word_tokenize(field) code . </br> I am a newbie to Python and NLTK so I have some catching up to do. Now I am receiving a encoding error -code UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 1: ordinal not in range(128). code </br> The file is utf-8 though.
yvespeirsman about 9 years

Try to import codecs and open the file as codecs.open('Milling_Final_Edited.csv', 'rU', encoding="utf-8")
OAK about 9 years

New code import csv, codecs import numpy as np from nltk import sent_tokenize, word_tokenize as word_tokenize, pos_tag as pos_tag reader =codecs.open('Milling_Final_Edited.csv', 'r', encoding="utf-8", errors="ignore") for line in reader: for field in line: tokens = word_tokenize(field) posData = pos_tag(tokens) print(posData)<br/>I had to add errors="ignore" to codecs.open to solve the encoding error but now I have another issue.The output is [(u'3', 'LS')]. What is the u character infant of the actual file character mean ('3')?Also, the output is only 1 line out of 25 lines.
yvespeirsman about 9 years

The u indicates this is a unicode string. I assume the script only outputs one line because of a problem with your indentation. Make sure the print statement is at the same indentation level as the word_tokenize call.
OAK about 9 years

You are correct! thanks for your assistance. You helped me tremendously today!! There are still bumps in the road but thats ok.
yvespeirsman about 9 years

Glad to hear I could be of assistance! By the way, if you accept this answer by clicking the tick mark to the left, you'll help other users find their way to this solution.