UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

66,919

Solution 1

site[i:i+35].decode('utf-8')

You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.

Solution 2

Open the csv file in sublime and "Save with Encoding" -> UTF-8.

Share:
66,919
user3701032
Author by

user3701032

Updated on June 09, 2021

Comments

  • user3701032
    user3701032 almost 3 years

    I'm trying to write a scraper , but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7 told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.

    My code looks like this:

    from urllib import FancyURLopener
    import os
    
    class MyOpener(FancyURLopener): #spoofs a real browser on Window
       version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
    
    print "What is the webaddress?"
    webaddress = raw_input("8::>")
    
    print "Folder Name?"
    foldername = raw_input("8::>")
    
    if not os.path.exists(foldername):
        os.makedirs(foldername)
    
    def urlpuller(start, page):
       while page[start]!= '"':
          start += 1
       close = start
       while page[close]!='"':
          close += 1
       return page[start:close]
    
    myopener = MyOpener()
    
    response = myopener.open(webaddress)
    site = response.read()
    
    nexturl = ''
    counter = 0
    
    while(nexturl!=webaddress):
       counter += 1
       start = 0
       
       for i in range(len(site)-35):
           if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
             start = i + 40
             break
       else:
          print "Something's broken, chief. Error = 1"
       
       next = 0
       
       for i in range(start, 8, -1):
          if site[i:i+8] == u'<a href=':
             next = i
             break
       else:
          print "Something's broken, chief. Error = 2"
       
       nexturl = urlpuller(next, site)
       
       myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')
    
    print("Retrieval of "+foldername+" completed.")
    

    When I try to run it using the site I'm using, it returns the error:

    Traceback (most recent call last):
      File "yada/yadayada/Python/scraper.py", line 37, in <module>
        if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
      File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data
    

    When pointed at http://google.com, it worked just fine.

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    

    but when I try to decode using utf-8, as you can see, it does not work.

    Any suggestions?

  • user3701032
    user3701032 almost 10 years
    Does this fix the encoding problem?
  • user3701032
    user3701032 almost 10 years
    Is there a way to do the decoding myself or is that much more complicated?
  • Martin Konecny
    Martin Konecny almost 10 years
    You would need some kind of stream utf8 decoder so that you know when you can break off your string. Alternatively you can decode the whole page at once (don't split up your string)
  • Martin Konecny
    Martin Konecny almost 10 years
    Take a look here for a streamdecoder mikehadlow.blogspot.ca/2012/07/…
  • user3701032
    user3701032 almost 10 years
    I'm trying to use BeautifulSoup now. What would I do to find the img with the ID imgSized?
  • user3701032
    user3701032 almost 10 years
    I'm able to search img, but I'm not sure why it's having problems with the tags. I was able to isolate the image I need, but ideally I'd like to be able to search for the link associated with the mouse over text as well.
  • Martin Konecny
    Martin Konecny almost 10 years