How do I unescape HTML entities in a string in Python 3.1?

91,009

Solution 1

You could use the function html.unescape:

In Python3.4+ (thanks to J.F. Sebastian for the update):

import html
html.unescape('Suzy & John')
# 'Suzy & John'

html.unescape('"')
# '"'

In Python3.3 or older:

import html.parser    
html.parser.HTMLParser().unescape('Suzy & John')

In Python2:

import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy & John')

Solution 2

You can use xml.sax.saxutils.unescape for this purpose. This module is included in the Python standard library, and is portable between Python 2.x and Python 3.x.

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy & John")
'Suzy & John'

Solution 3

Apparently I don't have a high enough reputation to do anything but post this. unutbu's answer does not unescape quotations. The only thing that I found that did was this function:

import re
from htmlentitydefs import name2codepoint as n2cp

def decodeHtmlentities(string):
    def substitute_entity(match):        
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)
            if cp:
                return unichr(cp)
            else:
                return match.group()
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
    return entity_re.subn(substitute_entity, string)[0]

Which I got from this page.

Solution 4

Python 3.x has html.entities too

Solution 5

In my case I have a html string escaped in as3 escape function. After a hour of googling haven't found anything useful so I wrote this recusrive function to serve for my needs. Here it is,

def unescape(string):
    index = string.find("%")
    if index == -1:
        return string
    else:
        #if it is escaped unicode character do different decoding
        if string[index+1:index+2] == 'u':
            replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape')
            string = string.replace(string[index:index+6],replace_with)
        else:
            replace_with = string[index+1:index+3].decode('hex')
            string = string.replace(string[index:index+3],replace_with)
        return unescape(string)

Edit-1 Added functionality to handle unicode characters.

Share:
91,009
VolatileRig
Author by

VolatileRig

Currently Innovating at Hiya Inc. Super Important Legalese: I don't represent or speak on behalf of my employer or any of its subsidiaries, partners or affiliates. Anything posted here or on other sites by me are my own views/opinions, no one else's. Any code I write here or on other sites comes with no implicit or explicit guarantees, promises, license, or warranty and I'm not liable for any damages as a result of running that code (please inspect and use discretion when running ANYONE's code).

Updated on May 27, 2020

Comments

  • VolatileRig
    VolatileRig almost 4 years

    I have looked all around and only found solutions for python 2.6 and earlier, NOTHING on how to do this in python 3.X. (I only have access to Win7 box.)

    I HAVE to be able to do this in 3.1 and preferably without external libraries. Currently, I have httplib2 installed and access to command-prompt curl (that's how I'm getting the source code for pages). Unfortunately, curl does not decode html entities, as far as I know, I couldn't find a command to decode it in the documentation.

    YES, I've tried to get Beautiful Soup to work, MANY TIMES without success in 3.X. If you could provide EXPLICIT instructions on how to get it to work in python 3 in MS Windows environment, I would be very grateful.

    So, to be clear, I need to turn strings like this: Suzy & John into a string like this: "Suzy & John".

  • Martin Thoma
    Martin Thoma over 12 years
    This does not unescape " for example.
  • unutbu
    unutbu over 12 years
    @moose: Thanks for the warning. I've changed my answer to something that handles more HTML entities, including ".
  • Martin Thoma
    Martin Thoma over 12 years
    Thank you very much! I gave your answer +1.
  • bcoughlan
    bcoughlan over 11 years
    Seems to be incomplete, '&euml' didn't decode with this although it does with htmlparser
  • iElectric
    iElectric about 11 years
    Or for those that use six: six.moves.html_parser.HTMLParser().unescape
  • jfs
    jfs over 9 years
    It is exposed as html.escape() since Python 3.4
  • Saurabh Yadav
    Saurabh Yadav about 6 years
    html package is not installing in python 3.6
  • unutbu
    unutbu about 6 years
    @SaurabhYadav: The html package is part of the Python standard library. It does not need to be installed separately. If import html raises an error, then your Python distribution was not installed properly.
  • Ángel
    Ángel over 4 years
    It also doesn't unescape decimal characters
  • canbax
    canbax about 4 years
    It does not work for 'Don&‌#039;t forget that &‌pi; = 3.14 &‌amp; doesn&‌#039;t equal 3.' WHY is that?
  • Clément
    Clément about 2 years
    @canbax because the & is followed by a \u200c, a zero-width non-joiner character.
  • canbax
    canbax about 2 years
    @KiranJonnalagadda thanks but it's been more then 2 years