Escape unescaped characters in XML with Python

14,419

Solution 1

If you don't care about invalid characters in the xml you could use XML parser's recover option (see Parsing broken XML with lxml.etree.iterparse):

from lxml import etree
parser = etree.XMLParser(recover=True) # recover from bad characters.
root = etree.fromstring(broken_xml, parser=parser)
print etree.tostring(root)

Output

<root>
<element>
<name>name  surname</name>
<mail>[email protected]</mail>
</element>
</root>

Solution 2

<name>name & surname</name>

is not well-formed XML. It should be:

<name>name & surname</name>

All conformant XML tools should create this - you normally do not have to worry. If you create a string with the '&' character then an XML tool will output the escaped version. If you create the string by hand it is your responsibility to make sure it is escaped. If you use an XML editor it should escape it for you.

If the file has been given you by someone else, send it back and tell them it is not well-formed. If they no longer exist you will have to use a plain text editor. That's fragile and messy but there is no other way. If the file has ampersands elsewhere that are used for escaping then the file is garbage.

See a 10-year-old post here and a later one here.

Solution 3

You're probably just wanting to do some simple regexp-ery on the HTML before throwing it into BeautifulSoup.

Even simpler, if there aren't any SGML entities (&...;) in the code, html=html.replace('&','&') will do the trick.

Otherwise, try this:

x ="<html><h1>Fish & Chips & Gravy</h1><p>Fish & Chips & Gravy</p>"
import re
q=re.sub(r'&([^a-zA-Z#])',r'&\1',x)
print q

Essentially the regex looks for & not followed by alpha-numeric or # characters. It won't deal with ampersands at the end of lines, but that's probably fixable.

14,419

Jérôme Pigeot

Updated on June 04, 2022

Comments

Jérôme Pigeot 12 months
I need to escape special characters in an invalid XML file which is about 5000 lines long. Here's an example of the XML that I have to deal with:
```
<root>
 <element>
  <name>name & surname</name>
  <mail>[email protected]</mail>
 </element>
</root>
```
Here the problem is the character "&" in the name. How would you escape special characters like this with a Python library? I didn't find a way to do it with BeautifulSoup.
Jérôme Pigeot over 12 years

the xml is generated by a novell tool called metamig : it exports the trustees from a nss server: there are folders with the & character so i have to escape all thoses ones to parse correctly the file
peter.murray.rust over 12 years

Assuning you have quoted it correctly it's PSEUDO-xml. I don't know this tool but if you have reported it correctly it should never have got out. It's just WRONG. And if you pay money for it, demand your money back.
Jérôme Pigeot over 12 years

finally i used the parse method from lxml.html.soupparser : it can parse my ugly xml without crying :) Thanks for your answer
Asclepius almost 4 years

This answer, while useful, won't escape the unescaped characters. It will apparently simply drop them.
lorenzo_campanile over 3 years

Thank you peter, I didn't know the '&' should be escaped in a correct XML file. You saved me a research on why Python ElementTree didn't show the '&' character