Parsing UTF-8/unicode strings with lxml HTML

14,349

Ok and just found. Writing the question on StackOverflow helps often.

etree.HTML() is trying to guess the encoding according to the meta in the document

<meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/>

In this case, I have converted manually the document to utf-8, which means it is not anymore the Japanese encoding: EUC-JP. So to solve the issue is just a matter of forcing the HTML parser to understand utf-8. In our case the code becomes:

>>> myparser = etree.HTMLParser(encoding="utf-8")
>>> tree = etree.HTML(htmltext, parser=myparser)
Share:
14,349

Related videos on Youtube

Bayleef
Author by

Bayleef

Have been making things with the Web since 1991 or 1992. Python, Web technologies and a few other things. Currently working for Mozilla, always opened to proposals. Linkedin Profile Previous employers: Opera Software Pheromone Web Agency W3C - World Wide Web Consortium

Updated on June 22, 2022

Comments

  • Bayleef
    Bayleef almost 2 years

    I have been trying to parse with etree.HTML() a text encoded as UTF-8 without success.

    → python
    Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from lxml import etree
    >>> import requests
    >>> headers = {'User-Agent': "Opera/9.80 (Macintosh; Intel Mac OS X 10.8.0) Presto/2.12.363 Version/12.50"}
    >>> r = requests.get("http://www.rakuten.co.jp/", headers=headers)
    >>> r.status_code
    200
    >>> r.headers
    {'x-cache': 'MISS from www.rakuten.co.jp', 'transfer-encoding': 'chunked', 'set-cookie': 'wPzd=lng%3DNA%3Acnt%3DCA; expires=Tue, 13-Aug-2013 16:51:38 GMT; path=/; domain=www.rakuten.co.jp', 'server': 'Apache', 'pragma': 'no-cache', 'cache-control': 'private', 'date': 'Mon, 13 Aug 2012 16:51:38 GMT', 'content-type': 'text/html; charset=EUC-JP'}
    >>> responsetext = r.text
    

    So far so good. The response text is good and it is a unicode string. Now if I'm trying to get the list of CSS URIs. No issue either.

    >>> tree = etree.HTML(responsetext)
    >>> csspathlist = tree.xpath('//link[@rel="stylesheet"]/@href')
    >>> csspathlist
    ['http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/common.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/layout.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/sidecolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/api.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/myrakuten_dpgs.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/leftcolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/header.css?v=1207111500', '/com/inc/home/20080930/opt/css/normal/footer.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/ipad.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/genre.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/supersale.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/rakuten_membership.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/noscript/set.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/suggest-2.0.1.css?v=1204231500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/liquid_banner.css?v=1203011138', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/area_announce.css?v=1203011138']
    

    Now let's change from unicode to UTF-8 and request again the list of CSS URIs.

    >>> htmltext = responsetext.encode('utf-8')
    >>> tree2 = etree.HTML(htmltext)
    >>> csspathlist2 = tree2.xpath('//link[@rel="stylesheet"]/@href')
    >>> csspathlist2
    []
    

    I get an empty list.

    >>> etree.tostring(tree2)
    '<html lang="ja" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/><meta http-equiv="Content-Style-Type" content="text/css"/><meta http-equiv="Content-Script-Type" content="text/javascript"/><title/></head></html>'
    

    Indeed, the second parsing stopped right away after the first Japanese character in the title.

    <meta http-equiv="Content-Script-Type" content="text/javascript"/>
    <title> 【楽天市場】Shopping is Entertainment! : インターネット最大級の通信販売、通販オンラインショッピングコミュニティ </title>
    

    I'm still trying to understand what I have done wrong.

    • Andreas Jung
      Andreas Jung over 11 years
      +1 for the good write up
    • Bayleef
      Bayleef over 11 years
      @Maulwurfn thanks. Found the answer after searching for 3 hours. Finally in a matter of a few minutes I figured out once I had written it properly.