Delete XML node containing certain element

sed regular-expression python perl xml

10,917

Solution 1

With xmlstarlet:

xmlstarlet ed -d '//Placemark[.//tessellate]' < myplaces.kml

And as kml uses namespaces, you have to define it first (see the xmlstarlet documentation)

xmlstarlet ed -N 'ns=http://www.opengis.net/kml/2.2' -d '//ns:Placemark[.//ns:tessellate]'

With perl, you'd need to process the file as a whole (not line by line) and add the s flag to s///. And even then, even with non-greedy match, it would still match from the first <Placemark> up the next </Placemark> that occurs after the next <tessellate>. So you'd need to write it something like:

perl -0777 -pe 's|(<Placemark>.*?</Placemark>)|
   $1 =~ /<tessellate>/?"":$1|gse'

Solution 2

Using Python (2.7) with standard modules:

file test.xml:

<Container>
<Placemark>
  <KeepMe/>
</Placemark>
<Placemark>
    <styleUrl>#m_ylw-pushpin330</styleUrl>
    <LineString>
        <tessellate>1</tessellate>
        <coordinates>
            0.0000000000000,0.0000000000000,0 0.0000000000000,0.0000000000000,0
        </coordinates>
    </LineString>
</Placemark>
</Container>

And the program:

#! /usr/bin/env python

from __future__ import print_function # works on 2.x and 3.x
from lxml import etree

file_name = 'test.xml'
root = etree.parse(file_name)
for element in root.iterfind('.//Placemark'):
    if(element.find('.//tessellate')) is not None:
        element.getparent().remove(element)

print(etree.tostring(root))

gives as output:

<Container>
<Placemark>
  <KeepMe/>
</Placemark>
</Container>

Solution 3

Given this test file:

start
<Placemark>
        <tessellate>1</tessellate>
</Placemark>
middle1
<Placemark>
</Placemark>
middle2
<Placemark>
        <tessellate>1</tessellate>
</Placemark>
end

If you do perl -0 -pe 's|<Placemark>.*?<tessellate>.*?</Placemark>||gs' like you suggested it will remove too much:

start

middle1

end

This is because the regex is only looking forward. It finds a start tag, takes everything until the first tessellate tag and up to the next end tag. Unfortunatey it does not care if it consumes more start tags in the way...

If you want to do it with regexes you have to process each block on its own: perl -0 -pe 's|<Placemark>.*?</Placemark>|$&=~/<tessellate>/?"":$&|gse'

This should give the desired result.

10,917

Rizwan Khan

Updated on September 18, 2022

Comments

Rizwan Khan over 1 year
I want to remove all Placemarks from a KML file that contain the element <tessellate>. The following block should be wholly removed:
```
<Placemark>
    <styleUrl>#m_ylw-pushpin330</styleUrl>
    <LineString>
        <tessellate>1</tessellate>
        <coordinates>
            0.0000000000000,0.0000000000000,0 0.0000000000000,0.0000000000000,0
        </coordinates>
    </LineString>
</Placemark>
```
I have tried some non-greedy perl regex with no luck (a lot of stuff is removed together with the first <Placemark>):
```
sed -r ':a; N; $!ba; s/\n\t*//g' myplaces.kml |
perl -pe 's|<Placemark>.*?<tessellate>.*?</Placemark>||g'
```
I believe a XML parser is the way to go, but I read the documentation for xmlstarlet and got nowhere. So any solutions in xmlstarlet, python, etc. are also welcome!
- michas about 11 years
  
  Any good reason for not using an xml parser?
- Gilles 'SO- stop being evil' about 11 years
  
  Definitely use an XML parser.
Mathias Begert over 9 years

You mentioned standard modules, but lxml is not standard. Did you mean ElementTree?
Nasri Najib over 8 years

Just adding desired result output: start middle1 <Placemark> </Placemark> middle2 end
abhishek over 4 years

Using xmlstarlet is the best answer, works like a charm on complex XMLs as well as cases where selection needs to be based on attribute value. Also, if you are not able to install xmlstartlet using yum etc., see this link -- pkgs.org/download/xmlstarlet. I was able to download Linux package and run it as a standalone utility, without needing sudo/root access to install new packages.