How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

101,888

Solution 1

As justhalf points out above, my question here is essentially a duplicate of this question.

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

soup = BeautifulSoup(response.read().decode('utf-8'))

I would get the error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

Solution 2

Encoding the result to utf-8 seems to work for me:

print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')

It yields:

Hier können Sie sich kostenlos registrieren und / oder einloggen!
Share:
101,888
Christopher Orr
Author by

Christopher Orr

☕️ If I helped you out, feel free to buy me a coffee… 🙂 You can find my email on my website, GitHub or LinkedIn. Some Jenkins plugins I've created: Android: Emulator — Lint — Google Play Publisher Others:   GCM — Git Tag Message — Go(lang) Some Android apps I've built: Clue: Popular menstruation tracker and education app Migros: The largest retailer in Switzerland Snow Report: Ski conditions, also via Android Wear Best Swiss Hotels: Fun UI for finding hotels SumUp: Accept card payments Telekom Voicemail: Visual voicemail Telekom MyWallet: Early NFC payments app Some talks I've given: Droidcon Berlin 2012: Preventing Privacy Problems Droidcon London 2012: CI and app quality with Jenkins (slides, video) FOSDEM 2013: Building, testing and deploying mobile apps (slides) Dutch Android User Group, June 2013: Better Builds Jenkins User Conference Berlin 2014: Building, testing and deploying Android apps with Jenkins Advanced git with Jenkins San Francisco Android User Group 2015: Automating Android Build, Test & CD with Jenkins Google GDG DevFest NL 2015: Getting Started with CD for Android Jenkins World 2016, Santa Clara Day of Jenkins 2017, Gothenburg & Oslo: Keynote Berlindroid Meetup 2018: Command line magic for Android devs

Updated on October 04, 2021

Comments

  • Christopher Orr
    Christopher Orr over 2 years

    I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.

    However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.

    Sample program:

    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    # Fetch URL
    url = 'http://www.voxnow.de/'
    request = urllib2.Request(url)
    request.add_header('Accept-Encoding', 'utf-8')
    
    # Response has UTF-8 charset header,
    # and HTML body which is UTF-8 encoded
    response = urllib2.urlopen(request)
    
    # Parse with BeautifulSoup
    soup = BeautifulSoup(response)
    
    # Print title attribute of a <div> which uses umlauts (e.g. können)
    print repr(soup.find('div', id='navbutton_account')['title'])
    

    Running this gives the result:

    # u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'
    

    But I would expect a Python Unicode string to render ö in the word können as \xf6:

    # u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'
    

    I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read() and decode() the response object, but it either makes no difference, or throws an error.

    With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ö character:

          20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|
          6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|
          73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|
    

    I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?