how to remove non utf 8 code and save as a csv file python

11,042

If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...

You have basically 2 ways to deal with decode errors:

  • use a charset that will accept any byte such as iso-8859-15 also known as latin9
  • if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)

For example:

f = open(INPUT_FILE_NAME,encoding="latin9")

or

f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
Share:
11,042
Jasmine
Author by

Jasmine

Updated on June 05, 2022

Comments

  • Jasmine
    Jasmine almost 2 years

    I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte

    I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?

    thank you!

    EDIT1: Here is the code i convert to text to csv:

    import csv
    import string
    INPUT_FILE_NAME = "small-movies.txt"
    OUTPUT_FILE_NAME = "small-movies1.csv"
    header = [
        "product/productId",
        "review/userId",
        "review/profileName",
        "review/helpfulness",
        "review/score",
        "review/time",
        "review/summary",
        "review/text"]
    f = open(INPUT_FILE_NAME,encoding="utf-8")
    
    outfile = open(OUTPUT_FILE_NAME,"w")
    
    outfile.write(",".join(header) + "\n")
    currentLine = []
    for line in f:
    
       line = line.strip()  
       #need to reomve the , so that the comment review text won't be in many columns
       line = line.replace(',','')
    
       if line == "":
          outfile.write(",".join(currentLine))
          outfile.write("\n")
          currentLine = []
          continue
       parts = line.split(":",1)
       currentLine.append(parts[1])
    
    if currentLine != []:
        outfile.write(",".join(currentLine))
    f.close()
    outfile.close()
    

    EDIT2:

    Thanks to all of you trying to helping me out. So I have solved it by modify the output format in my code:

     outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
    
    • Martijn Pieters
      Martijn Pieters over 8 years
      More likely that all data is not UTF-8 encoded. Show us how you converted the text format to CSV.
    • Jasmine
      Jasmine over 8 years
      i pasted into my questions. thanks
    • Dunes
      Dunes over 8 years
      Why haven't you specified the encoding of the file you are writing to? It is likely that the default encoding of your OS is something other than utf-8.
    • Jasmine
      Jasmine over 8 years
      @Dunes you are absolutely right!
  • Jasmine
    Jasmine over 8 years
    Hi - Thanks a lot for the quick update. i ran the code, but i got this error msg: AttributeError: 'str' object has no attribute 'decode' I guess the reason is because when I convert the text file to csv file: f = open(INPUT_FILE_NAME,encoding="utf-8") , the thing is here I have to add encoding utf8 otherwise, my text to csv will throw me this error msg " return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5206: character maps to <undefined>"
  • hspandher
    hspandher over 8 years
    Are you using python 2x or 3x
  • Jasmine
    Jasmine over 8 years
    it is 3.4 here is my code to convert to text to csv:
  • hspandher
    hspandher over 8 years
    Ok I haven't read your comment clearly, so you are using 'utf-8' encoding to open the file as I mentioned in my edited answer. So, what's the issue now
  • Jasmine
    Jasmine over 8 years
    I just pasted the convert txt to csv code in my question part.
  • hspandher
    hspandher over 8 years
    It is working when you are adding encoding, Isn't it?
  • Serge Ballesta
    Serge Ballesta over 8 years
    This answer is for Python 3.x as OP stated he was using Python 3.4
  • hspandher
    hspandher over 8 years
    I have edited my answer to remove unicode content, please give this one a try it should work
  • Jasmine
    Jasmine over 8 years
    i got this by using each of them:
  • Jasmine
    Jasmine over 8 years
    thank you very much. sorry i wasn't being clear at first. I don't know why there is a -1, but i cannot make it to 0 or 1, stackflow say i don't have enough reputation to do so...
  • hspandher
    hspandher over 8 years
    @Jasmine you can only upvote an answer only after you gain 15 reputation, but you can mark it as correct answer if it helped