Why won't Python display this text correctly? (UTF-8 Decoding Issue)
When you call con.text()
, this returns a bytes
object. Calling str()
on it returns a string of the representation of it - thus, the escapes are used rather than the real characters, if you don't specify an encoding. (That means that your string ends up containing \\xe2\\x80\\x99
as well as all sorts of other undesired things.) bytes
is mostly like str
in Python 2: it doesn't have any encoding information stored. str
in Python 3 is like unicode
in Python 2; it has the encoding. So, when turning a bytes
object into a str
object, you need to tell it what encoding it is actually in. In this case, that's utf-8
.
Instead of calling str()
on it, you would be better to use bytes.decode
; it's the same thing, just neater.
>>> import urllib.request as u
>>> zipcode = 47401
>>> url = 'http://watchdog.net/us/?zip={}'.format(zipcode)
>>> con = u.urlopen(url)
>>> page = con.read().decode('utf-8')
>>> page[page.find("<title>") + 7:page.find("</title>") - 15]
'IN-09: Indiana’s 9th'
The only functional change that has been made here is the specification to decode the bytes
object as 'utf-8'
.
Admin
Updated on July 24, 2022Comments
-
Admin over 1 year
import urllib.request as u zipcode = str(47401) url = 'http://watchdog.net/us/?zip=' + zipcode con = u.urlopen(url) page = str(con.read()) value3 = int(page.find("<title>")) + 7 value4 = int(page.find("</title>")) - 15 district = str(page[value3:value4]) print(district) newdistrict = district.replace("\xe2\x80\x99","'") print(newdistrict)
For some reason, my code is pulling in the title in the following format:
IN-09: Indiana\xe2\x80\x99s 9th
. I know the\xe
string of characters is unicode for the'
symbol, but I can't figure out how to get python to replace that set of characters with the'
symbol. I've tried decoding the string but it's already in unicode and the replace code above doesn't change anything. Any advice as to what I'm doing incorrectly? -
Admin almost 12 yearsThanks for your help, I had initially tried to decode the file by using something like: page = con.read() newpage = page.decode('utf-8') which had worked on previous assignments but was giving me a blank page here. Then I found that by removing the decode line I could get it to return the source code, so I just started working with that. Not sure what was going on, thanks again for your help. :)
-
Chris Morgan almost 12 yearsYour answer isn't correct as it's Python 3 he's dealing with.
-
Chris Morgan almost 12 yearsBasically, it comes to the fact that
str(b'\xab')
produces"b'\\xab'"
instead of'\xab'
(it is equivalent to `repr(b'\xab') as there is no meaningful conversion without specifying the encoding).