UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data
Solution 1
site[i:i+35].decode('utf-8')
You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data
error.
Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.
Solution 2
Open the csv file in sublime and "Save with Encoding" -> UTF-8.
user3701032
Updated on June 09, 2021Comments
-
user3701032 almost 3 years
I'm trying to write a scraper , but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file,
python2.7
told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.My code looks like this:
from urllib import FancyURLopener import os class MyOpener(FancyURLopener): #spoofs a real browser on Window version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11' print "What is the webaddress?" webaddress = raw_input("8::>") print "Folder Name?" foldername = raw_input("8::>") if not os.path.exists(foldername): os.makedirs(foldername) def urlpuller(start, page): while page[start]!= '"': start += 1 close = start while page[close]!='"': close += 1 return page[start:close] myopener = MyOpener() response = myopener.open(webaddress) site = response.read() nexturl = '' counter = 0 while(nexturl!=webaddress): counter += 1 start = 0 for i in range(len(site)-35): if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"': start = i + 40 break else: print "Something's broken, chief. Error = 1" next = 0 for i in range(start, 8, -1): if site[i:i+8] == u'<a href=': next = i break else: print "Something's broken, chief. Error = 2" nexturl = urlpuller(next, site) myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg') print("Retrieval of "+foldername+" completed.")
When I try to run it using the site I'm using, it returns the error:
Traceback (most recent call last): File "yada/yadayada/Python/scraper.py", line 37, in <module> if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"': File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data
When pointed at http://google.com, it worked just fine.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
but when I try to decode using utf-8, as you can see, it does not work.
Any suggestions?
-
user3701032 almost 10 yearsDoes this fix the encoding problem?
-
user3701032 almost 10 yearsIs there a way to do the decoding myself or is that much more complicated?
-
Martin Konecny almost 10 yearsYou would need some kind of stream utf8 decoder so that you know when you can break off your string. Alternatively you can decode the whole page at once (don't split up your string)
-
Martin Konecny almost 10 yearsTake a look here for a streamdecoder mikehadlow.blogspot.ca/2012/07/…
-
user3701032 almost 10 yearsI'm trying to use BeautifulSoup now. What would I do to find the
img
with the IDimgSized
? -
user3701032 almost 10 yearsI'm able to search
img
, but I'm not sure why it's having problems with the tags. I was able to isolate the image I need, but ideally I'd like to be able to search for the link associated with the mouse over text as well. -
Martin Konecny almost 10 yearsThis should help stackoverflow.com/questions/11696745/…