NLTK python tokenizing a CSV file

16,542

As you can read in the Python csv documentation, csv.reader "returns a reader object which will iterate over lines in the given csvfile". In other words, if you want to tokenize the text in your csv file, you will have to go through the lines and the fields in those lines:

for line in reader:
    for field in line:
        tokens = word_tokenize(field)

Also, when you import word_tokenize at the beginning of your script, you should call it as word_tokenize, and not as nltk.word_tokenize. This also means you can drop the import nltk statement.

Share:
16,542
OAK
Author by

OAK

Updated on June 04, 2022

Comments

  • OAK
    OAK about 2 years

    I have began to experiment with Python and NLTK. I am experiencing a lengthy error message which I cannot find a solution to and would appreciate any insights you may have.

    import nltk,csv,numpy 
    from nltk import sent_tokenize, word_tokenize, pos_tag
    reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
    tokenData = nltk.word_tokenize(reader)
    

    I'm running Python 2.7 and the latest nltk package on OSX Yosemite. These are also two lines of code I attempted with no difference in results:

    with open("Medium_Edited.csv", "rU") as csvfile:
    tokenData = nltk.word_tokenize(reader)
    

    These are the error messages I see:

    Traceback (most recent call last):
      File "nltk_text.py", line 11, in <module>
        tokenData = nltk.word_tokenize(reader)
      File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
        return [token for sent in sent_tokenize(text, language)
      File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
        return tokenizer.tokenize(text)
      File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
        return list(self.sentences_from_text(text, realign_boundaries))
      File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
        return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
      File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
        return [(sl.start, sl.stop) for sl in slices]
      File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
        for sl1, sl2 in _pair_iter(slices):
      File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
        prev = next(it)
      File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
        for match in self._lang_vars.period_context_re().finditer(text):
    TypeError: expected string or buffer
    

    Thanks in advance

  • OAK
    OAK about 9 years
    Thanks for the response, this is my edited code: code import csv import numpy as np from nltk import sent_tokenize, word_tokenize as word_tokenize, pos_tag reader = csv.reader(open('Milling_Final_Edited.csv', 'rU'), delimiter=',', quotechar='"') for line in reader: for field in line: tokens = word_tokenize(field) code . </br> I am a newbie to Python and NLTK so I have some catching up to do. Now I am receiving a encoding error -code UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 1: ordinal not in range(128). code </br> The file is utf-8 though.
  • yvespeirsman
    yvespeirsman about 9 years
    Try to import codecs and open the file as codecs.open('Milling_Final_Edited.csv', 'rU', encoding="utf-8")
  • OAK
    OAK about 9 years
    New code import csv, codecs import numpy as np from nltk import sent_tokenize, word_tokenize as word_tokenize, pos_tag as pos_tag reader =codecs.open('Milling_Final_Edited.csv', 'r', encoding="utf-8", errors="ignore") for line in reader: for field in line: tokens = word_tokenize(field) posData = pos_tag(tokens) print(posData)<br/>I had to add errors="ignore" to codecs.open to solve the encoding error but now I have another issue.The output is [(u'3', 'LS')]. What is the u character infant of the actual file character mean ('3')?Also, the output is only 1 line out of 25 lines.
  • yvespeirsman
    yvespeirsman about 9 years
    The u indicates this is a unicode string. I assume the script only outputs one line because of a problem with your indentation. Make sure the print statement is at the same indentation level as the word_tokenize call.
  • OAK
    OAK about 9 years
    You are correct! thanks for your assistance. You helped me tremendously today!! There are still bumps in the road but thats ok.
  • yvespeirsman
    yvespeirsman about 9 years
    Glad to hear I could be of assistance! By the way, if you accept this answer by clicking the tick mark to the left, you'll help other users find their way to this solution.