Why won't Python display this text correctly? (UTF-8 Decoding Issue)

19,855

When you call con.text(), this returns a bytes object. Calling str() on it returns a string of the representation of it - thus, the escapes are used rather than the real characters, if you don't specify an encoding. (That means that your string ends up containing \\xe2\\x80\\x99 as well as all sorts of other undesired things.) bytes is mostly like str in Python 2: it doesn't have any encoding information stored. str in Python 3 is like unicode in Python 2; it has the encoding. So, when turning a bytes object into a str object, you need to tell it what encoding it is actually in. In this case, that's utf-8.

Instead of calling str() on it, you would be better to use bytes.decode; it's the same thing, just neater.

>>> import urllib.request as u
>>> zipcode = 47401
>>> url = 'http://watchdog.net/us/?zip={}'.format(zipcode)
>>> con = u.urlopen(url)
>>> page = con.read().decode('utf-8')
>>> page[page.find("<title>") + 7:page.find("</title>") - 15]
'IN-09: Indiana’s 9th'

The only functional change that has been made here is the specification to decode the bytes object as 'utf-8'.

Share:
19,855
Admin
Author by

Admin

Updated on July 24, 2022

Comments

  • Admin
    Admin over 1 year
    import urllib.request as u
    
    zipcode = str(47401)
    url = 'http://watchdog.net/us/?zip=' + zipcode
    con = u.urlopen(url)
    
    page = str(con.read())
    value3 = int(page.find("<title>")) + 7
    value4 = int(page.find("</title>")) - 15
    district = str(page[value3:value4])
    print(district)
    newdistrict = district.replace("\xe2\x80\x99","'")
    print(newdistrict)
    

    For some reason, my code is pulling in the title in the following format: IN-09: Indiana\xe2\x80\x99s 9th. I know the \xe string of characters is unicode for the ' symbol, but I can't figure out how to get python to replace that set of characters with the ' symbol. I've tried decoding the string but it's already in unicode and the replace code above doesn't change anything. Any advice as to what I'm doing incorrectly?

  • Admin
    Admin almost 12 years
    Thanks for your help, I had initially tried to decode the file by using something like: page = con.read() newpage = page.decode('utf-8') which had worked on previous assignments but was giving me a blank page here. Then I found that by removing the decode line I could get it to return the source code, so I just started working with that. Not sure what was going on, thanks again for your help. :)
  • Chris Morgan
    Chris Morgan almost 12 years
    Your answer isn't correct as it's Python 3 he's dealing with.
  • Chris Morgan
    Chris Morgan almost 12 years
    Basically, it comes to the fact that str(b'\xab') produces "b'\\xab'" instead of '\xab' (it is equivalent to `repr(b'\xab') as there is no meaningful conversion without specifying the encoding).