NLTK python tokenizing a CSV file
As you can read in the Python csv documentation, csv.reader
"returns a reader object which will iterate over lines in the given csvfile". In other words, if you want to tokenize the text in your csv file, you will have to go through the lines and the fields in those lines:
for line in reader:
for field in line:
tokens = word_tokenize(field)
Also, when you import word_tokenize
at the beginning of your script, you should call it as word_tokenize
, and not as nltk.word_tokenize
. This also means you can drop the import nltk
statement.
OAK
Updated on June 04, 2022Comments
-
OAK about 2 years
I have began to experiment with Python and NLTK. I am experiencing a lengthy error message which I cannot find a solution to and would appreciate any insights you may have.
import nltk,csv,numpy from nltk import sent_tokenize, word_tokenize, pos_tag reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|') tokenData = nltk.word_tokenize(reader)
I'm running Python 2.7 and the latest nltk package on OSX Yosemite. These are also two lines of code I attempted with no difference in results:
with open("Medium_Edited.csv", "rU") as csvfile: tokenData = nltk.word_tokenize(reader)
These are the error messages I see:
Traceback (most recent call last): File "nltk_text.py", line 11, in <module> tokenData = nltk.word_tokenize(reader) File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize return [token for sent in sent_tokenize(text, language) File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize return tokenizer.tokenize(text) File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize return [(sl.start, sl.stop) for sl in slices] File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices): File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter prev = next(it) File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer
Thanks in advance
-
OAK about 9 yearsThanks for the response, this is my edited code:
code import csv import numpy as np from nltk import sent_tokenize, word_tokenize as word_tokenize, pos_tag reader = csv.reader(open('Milling_Final_Edited.csv', 'rU'), delimiter=',', quotechar='"') for line in reader: for field in line: tokens = word_tokenize(field) code
. </br> I am a newbie to Python and NLTK so I have some catching up to do. Now I am receiving a encoding error -code UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 1: ordinal not in range(128). code
</br> The file is utf-8 though. -
yvespeirsman about 9 yearsTry to
import codecs
and open the file ascodecs.open('Milling_Final_Edited.csv', 'rU', encoding="utf-8")
-
OAK about 9 yearsNew code
import csv, codecs import numpy as np from nltk import sent_tokenize, word_tokenize as word_tokenize, pos_tag as pos_tag reader =codecs.open('Milling_Final_Edited.csv', 'r', encoding="utf-8", errors="ignore") for line in reader: for field in line: tokens = word_tokenize(field) posData = pos_tag(tokens) print(posData)
<br/>I had to adderrors="ignore"
tocodecs.open
to solve the encoding error but now I have another issue.The output is [(u'3', 'LS')]. What is the u character infant of the actual file character mean ('3')?Also, the output is only 1 line out of 25 lines. -
yvespeirsman about 9 yearsThe
u
indicates this is a unicode string. I assume the script only outputs one line because of a problem with your indentation. Make sure theprint
statement is at the same indentation level as theword_tokenize
call. -
OAK about 9 yearsYou are correct! thanks for your assistance. You helped me tremendously today!! There are still bumps in the road but thats ok.
-
yvespeirsman about 9 yearsGlad to hear I could be of assistance! By the way, if you accept this answer by clicking the tick mark to the left, you'll help other users find their way to this solution.