lxml etree xmlparser remove unwanted namespace

29,480

Solution 1

import io
import lxml.etree as ET

content='''\
<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  </Body>
</Envelope>
'''    
dom = ET.parse(io.BytesIO(content))

You can find namespace-aware nodes using the xpath method:

body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'})
print(body)
# [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]

If you really want to remove namespaces, you could use an XSL transformation:

# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl
xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<xsl:template match="/|comment()|processing-instruction()">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
</xsl:template>

<xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="."/>
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
dom=transform(dom)

Here we see the namespace has been removed:

print(ET.tostring(dom))
# <Envelope>
#   <Header>
#     <Version>1</Version>
#   </Header>
#   <Body>
#     some stuff
#   </Body>
# </Envelope>

So you can now find the Body node this way:

print(dom.find("Body"))
# <Element Body at 8506cd4>

Solution 2

Try using Xpath:

dom.xpath("//*[local-name() = 'Body']")

Taken (and simplified) from this page, under "The xpath() method" section

Solution 3

The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17 can help you to avoid namespaces with little effort

apply xml.replace(' xmlns:', ' xmlnamespace:') to your xml before using pyquery so lxml will ignore namespaces

In your case, try xml.replace(' xmlns="', ' xmlnamespace="'). However, you might need something more complex if the string is expected in the bodies as well.

Solution 4

Another not-too-bad option is to use the QName helper and wrap it in a function with a default namespace:

from lxml import etree

DEFAULT_NS = 'http://www.example.com/zzz/yyy'

def tag(name, ns=DEFAULT_NS):
    return etree.QName(ns, name)

dom = etree.parse(path)
body = dom.getroot().find(tag('Body'))
Share:
29,480
Mark
Author by

Mark

Updated on July 09, 2022

Comments

  • Mark
    Mark almost 2 years

    I have an xml doc that I am trying to parse using Etree.lxml

    <Envelope xmlns="http://www.example.com/zzz/yyy">
      <Header>
        <Version>1</Version>
      </Header>
      <Body>
        some stuff
      <Body>
    <Envelope>
    

    My code is:

    path = "path to xml file"
    from lxml import etree as ET
    parser = ET.XMLParser(ns_clean=True)
    dom = ET.parse(path, parser)
    dom.getroot()
    

    When I try to get dom.getroot() I get:

    <Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>
    

    However I only want:

    <Element Envelope at 28adacac>
    

    When i do

    dom.getroot().find("Body")
    

    I get nothing returned. However, when I

    dom.getroot().find("{http://www.example.com/zzz/yyy}Body") 
    

    I get a result.

    I thought passing ns_clean=True to the parser would prevent this.

    Any ideas?

  • Neil Albrock
    Neil Albrock over 12 years
    XSLT to remove all namespaces. Just what I was looking for, genius.
  • Walt W
    Walt W about 12 years
    This is amazing. You have changed my life, thank you. (ps, whoever designed XML namespaces, wtf?)
  • bukzor
    bukzor almost 11 years
    String munging is always the path to madness. In the general case, this answer is dead wrong. Suppose you're formatting an rss feed of this exact question -- the result would tell people to xml.replace(' xmlnamespace="', ' xmlnamespace="')...
  • AZhao
    AZhao almost 8 years
    FYI if using Python3 you will need to encode the xslt string first. ie xslt_doc=ET.parse(io.BytesIO(str.encode(xslt)))
  • Sergey Kolesnik
    Sergey Kolesnik over 2 years
    QName(ns, **name**) - misprint
  • Tom
    Tom over 2 years
    @SergeyKolesnik good catch - I've fixed the answer.