Extract the Children of a Specific XML Element Type

text-processing awk sed xml ctags

7,894

Solution 1

If you really want sed- or awk-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:

You should also be aware that there are several XML-specific programming/query languages:

Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:

<!-- data.xml -->

<instances>

    <instance ab='1'>
        <a1>aa</a1>
        <a2>aa</a2>
    </instance>

    <instance ab='2'>
        <b1>bb</b1>
        <b2>bb</b2>
    </instance>

    <instance ab='3'>
        <c1>cc</c1>
        <c2>cc</c2>
    </instance>

</instances>

If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:

xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml

This produces the following output:

<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>

Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
import xml.etree.ElementTree

# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()

# Extract and output the child elements
for instance in root.iter("instance"):
    print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))

And here is how you could run the script:

python extract_instance_children.py data.xml

This uses the xml package from the Python Standard Library which is also a strict XML parser.

If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk script (as requested):

#!/usr/bin/env awk

# extract_instance_children.awk

BEGIN {
    addchild=0;
    children="";
}

{
    # Opening tag for "instance" element - set the "addchild" flag
    if($0 ~ "^ *<instance[^<>]+>") {
        addchild=1;
    }

    # Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
    else if($0 ~ "^ *</instance>" && addchild == 1) {
        addchild=0;
        printf("%s\n", children);
        children="";
    }

    # Concatenating child elements - strip whitespace
    else if (addchild == 1) {
        gsub(/^[ \t]+/,"",$0);
        gsub(/[ \t]+$/,"",$0);
        children=children $0;
    }
}

To execute the script from a file, you would use a command like this one:

awk -f extract_instance_children.awk data.xml

And here is a Bash script that produces the desired output:

#!/bin/bash

# extract_instance_children.bash

# Keep track of whether or not we're inside of an "instance" element
instance=0

# Loop through the lines of the file
while read line; do

    # Set the instance flag to true if we come across an opening tag
    if echo "${line}" | grep -q '<instance.*>'; then
        instance=1

    # Set the instance flag to false and print a newline if we come across a closing tag
    elif echo "${line}" | grep -q '</instance>'; then
        instance=0
        echo

    # If we're inside an instance tag then print the child element
    elif [[ ${instance} == 1 ]]; then
        printf "${line}"
    fi

done < "${1}"

You would execute it like this:

bash extract_instance_children.bash data.xml

Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""

import sys
from bs4 import BeautifulSoup as Soup

with open(sys.argv[1], 'r') as xmlfile:
    soup = Soup(xmlfile.read(), "html.parser")
    for instance in soup.findAll('instance'):
        print(''.join([str(child) for child in instance.findChildren()]))

Solution 2

This may be of help:

#!/bin/bash

awk -vtag=instance -vp=0 '{
if($0~("^<"tag)){p=1;next}
if($0~("^</"tag)){p=0;printf("\n");next}
if(p==1){$1=$1;printf("%s",$0)}
}' infile

Assuming the Sample text in your example is a mistake and keeping it simple.

The p variable decides when to print. A $1=$1 removes leading spaces.

7,894

Abhi S

Updated on September 18, 2022

Comments

Abhi S over 1 year
Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:
```


<instance ab=1 >
    <a1>aa</a1>
    <a2>aa</a2>
</instance>
<instance ab=2 >
    <b1>bb</b1>
    <b2>bb</b2>
</instance>
<instance ab=3 >
    <c1>cc</c1>
    <c2>cc</c2>
</instance>
```
I would like a script or command which takes this data as input and produces the following output:
```
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
```
I would like for the solution to use standard text-processing tools such as sed or awk.

I tried using the following sed command, but it did not work:
```
sed -n '/<Sample/,/<\/Sample/p' data.xml
```
- igal over 6 years
  
  It's totally unclear what you're asking for here. Why would the sed command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to use sed and awk specifically? Are you sure that's a requirement?
- Kusalananda over 6 years
  
  The input file is not a properly formatted XML file. It lacks a single root element.
- G-Man Says 'Reinstate Monica' over 6 years
  
  ... and it’s indented peculiarly.
- igal over 6 years
  
  And the attribute values aren't quoted.
- igal over 6 years
  
  If your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
Abhi S over 6 years

Thanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
igal over 6 years

@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
igal over 6 years

This actually didn't work for me; I got no output at all. I changed the if($0~"\\<"tag) condition to if($0~"</"tag) and got the expected output, but not in the correct format (there was additional whitespace).
done over 6 years

@igal Maybe now, answer edited (made even simpler), spaces removed.
igal over 6 years

@Arrow Yup! Very nice. Upvoted!
igal over 6 years

@AbhiS If this solution worked for you, could you please accept it?