Extract the Children of a Specific XML Element Type
Solution 1
If you really want sed
- or awk
-like command-line processing for XML files then you should probably consider using an XML-processing command-line tool. Here are some of the tools that I've seen more commonly used:
You should also be aware that there are several XML-specific programming/query languages:
Note that (in order to be valid XML) your XML data needs a root node and that your attribute values should be quoted, i.e. your data file should look more like this:
<!-- data.xml -->
<instances>
<instance ab='1'>
<a1>aa</a1>
<a2>aa</a2>
</instance>
<instance ab='2'>
<b1>bb</b1>
<b2>bb</b2>
</instance>
<instance ab='3'>
<c1>cc</c1>
<c2>cc</c2>
</instance>
</instances>
If your data is formatted as valid XML, then you can use XPath with xmlstarlet to get exactly what you want with a very concise command:
xmlstarlet sel -t -m '//instance' -c "./*" -n data.xml
This produces the following output:
<a1>aa</a1><a2>aa</a2>
<b1>bb</b1><b2>bb</b2>
<c1>cc</c1><c2>cc</c2>
Or you could use Python (my personal favorite choice). Here is a Python script that accomplishes the same task:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
import xml.etree.ElementTree
# Load the data
tree = xml.etree.ElementTree.parse(sys.argv[1])
root = tree.getroot()
# Extract and output the child elements
for instance in root.iter("instance"):
print(''.join([xml.etree.ElementTree.tostring(child).strip() for child in instance]))
And here is how you could run the script:
python extract_instance_children.py data.xml
This uses the xml package from the Python Standard Library which is also a strict XML parser.
If you're not concerned with having properly formatted XML and you just want to parse a text file that looks roughly like the one you've presented, then you can definitely accomplish what you want just using shell-scripting and standard command-line tools. Here is an awk
script (as requested):
#!/usr/bin/env awk
# extract_instance_children.awk
BEGIN {
addchild=0;
children="";
}
{
# Opening tag for "instance" element - set the "addchild" flag
if($0 ~ "^ *<instance[^<>]+>") {
addchild=1;
}
# Closing tag for "instance" element - reset "children" string and "addchild" flag, print children
else if($0 ~ "^ *</instance>" && addchild == 1) {
addchild=0;
printf("%s\n", children);
children="";
}
# Concatenating child elements - strip whitespace
else if (addchild == 1) {
gsub(/^[ \t]+/,"",$0);
gsub(/[ \t]+$/,"",$0);
children=children $0;
}
}
To execute the script from a file, you would use a command like this one:
awk -f extract_instance_children.awk data.xml
And here is a Bash script that produces the desired output:
#!/bin/bash
# extract_instance_children.bash
# Keep track of whether or not we're inside of an "instance" element
instance=0
# Loop through the lines of the file
while read line; do
# Set the instance flag to true if we come across an opening tag
if echo "${line}" | grep -q '<instance.*>'; then
instance=1
# Set the instance flag to false and print a newline if we come across a closing tag
elif echo "${line}" | grep -q '</instance>'; then
instance=0
echo
# If we're inside an instance tag then print the child element
elif [[ ${instance} == 1 ]]; then
printf "${line}"
fi
done < "${1}"
You would execute it like this:
bash extract_instance_children.bash data.xml
Or, going back to Python once again, you could use the Beautiful Soup package. Beautiful Soup is much more flexible in its ability to parse invalid XML than the standard Python XML module (and every other XML parser that I've come across). Here is a Python script which uses Beautiful Soup to achieve the desired result:
#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""extract_instance_children.bash"""
import sys
from bs4 import BeautifulSoup as Soup
with open(sys.argv[1], 'r') as xmlfile:
soup = Soup(xmlfile.read(), "html.parser")
for instance in soup.findAll('instance'):
print(''.join([str(child) for child in instance.findChildren()]))
Solution 2
This may be of help:
#!/bin/bash
awk -vtag=instance -vp=0 '{
if($0~("^<"tag)){p=1;next}
if($0~("^</"tag)){p=0;printf("\n");next}
if(p==1){$1=$1;printf("%s",$0)}
}' infile
Assuming the Sample
text in your example is a mistake and keeping it simple.
The p variable decides when to print. A $1=$1
removes leading spaces.
Related videos on Youtube
Abhi S
Updated on September 18, 2022Comments
-
Abhi S over 1 year
Given a specific XML element (i.e. a specific tag name) and a snippet of XML data, I want to extract the children from each occurrence of that element. More specifically, I have the following snippet of (not quite valid) XML data:
<!-- data.xml --> <instance ab=1 > <a1>aa</a1> <a2>aa</a2> </instance> <instance ab=2 > <b1>bb</b1> <b2>bb</b2> </instance> <instance ab=3 > <c1>cc</c1> <c2>cc</c2> </instance>
I would like a script or command which takes this data as input and produces the following output:
<a1>aa</a1><a2>aa</a2> <b1>bb</b1><b2>bb</b2> <c1>cc</c1><c2>cc</c2>
I would like for the solution to use standard text-processing tools such as
sed
orawk
.I tried using the following
sed
command, but it did not work:sed -n '/<Sample/,/<\/Sample/p' data.xml
-
igal over 6 yearsIt's totally unclear what you're asking for here. Why would the
sed
command you're using have any effect with the data you're using? The search string you're using doesn't appear in the data. And why do you want to usesed
andawk
specifically? Are you sure that's a requirement? -
Kusalananda over 6 yearsThe input file is not a properly formatted XML file. It lacks a single root element.
-
G-Man Says 'Reinstate Monica' over 6 years... and it’s indented peculiarly.
-
igal over 6 yearsAnd the attribute values aren't quoted.
-
igal over 6 yearsIf your question has been resolved you should accept an answer so that the issue is closed. Otherwise this question will remain open and people may continue to submit solutions.
-
-
Abhi S over 6 yearsThanks. bash script is working as expected.. Thanks for your prompt and quick help.. My problem had been resolved.. No need to speed more time on it.
-
igal over 6 years@AbhiS That's great! I'm not just posting for you though. I'm trying to write a clear and complete answer for anyone else who happens to come by this post.
-
igal over 6 yearsThis actually didn't work for me; I got no output at all. I changed the
if($0~"\\<"tag)
condition toif($0~"</"tag)
and got the expected output, but not in the correct format (there was additional whitespace). -
done over 6 years@igal Maybe now, answer edited (made even simpler), spaces removed.
-
igal over 6 years@Arrow Yup! Very nice. Upvoted!
-
igal over 6 years@AbhiS If this solution worked for you, could you please accept it?