How to parse XML in Bash?
Solution 1
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
rdom () { local IFS=\> ; read -d \< E C ;}
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
}
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
<tag>value</tag>
The first call to read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag
and CONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag
and CONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.
Now his while loop cleaned up a bit to match the above:
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).
Now given the following (similar to what you get from listing a bucket on S3) for input.xml
:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>[email protected]</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
and the following loop:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
You should get:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => [email protected]
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
So if we wrote a while
loop like Yuzem's:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
We'd get a listing of all the files in the S3 bucket.
EDIT
If for some reason local IFS=\>
doesn't work for you and you set it globally, you should reset it at the end of the function like:
read_dom () {
ORIGINAL_IFS=$IFS
IFS=\>
read -d \< ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2
To split out attribute name/value pairs you can augment the read_dom()
like so:
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
Then write your function to parse and get the data you want like this:
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
Then while you read_dom
call parse_dom
:
while read_dom; do
parse_dom
done
Then given the following example markup:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
You should get this output:
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
I don't see any reason why that shouldn't work
Solution 2
You can do that very easily using only bash. You only have to add this function:
rdom () { local IFS=\> ; read -d \< E C ;}
Now you can use rdom like read but for html documents. When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
Solution 3
Command-line tools that can be called from shell scripts include:
-
4xpath - command-line wrapper around Python's 4Suite package
-
xpath - command-line wrapper around Perl's XPath library
sudo apt-get install libxml-xpath-perl
-
Xidel - Works with URLs as well as files. Also works with JSON
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
Solution 4
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
/usr/bin/xpath [filename] query
or XMLStarlet. To install it on opensuse use:
sudo zypper install xmlstarlet
or try cnf xml
on other platforms.
Solution 5
This is sufficient...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
Related videos on Youtube
Comments
-
Zombo over 3 years
Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml | getElementViaXPath --path='/html/head/title' | sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
-
Ciro Santilli OurBigBook.com over 8 years
-
-
Opher about 13 yearsWhere can I download 'xpath' or '4xpath' from ?
-
Alex Gray almost 13 yearscould you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
-
David over 12 yearsyes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
-
obesechicken13 almost 12 yearsThe listing is nice but I don't really know where to go from there. So say I wanted to put "1785", the "size" in a variable. How would I do that?
-
Admin almost 12 years@obesechicken13 Easy, let's say your variable is named
num
: look at the very lastwhile
loop in chad's answer. Instead ofecho $CONTENT
putnum=$CONTENT
. -
Admin almost 12 yearsFor me, the
read_dom
function only works if I make the IFS global :IFS='>'
. I had to remove thelocal
. -
chad almost 12 yearsIf you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
-
chad almost 12 years@obesechicken13, I added an example of parsing attributes.
-
Andrew Wagner over 11 yearssudo apt-get install libxml-xpath-perl
-
Bruno von Paris over 11 yearsUsing xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
-
Stephen Niedzielski about 11 yearsJust because you can write your own parser, doesn't mean you should.
-
Alastair almost 11 years@chad it certainly says something about AWS' workflow/implementation that I was searching for an answer to "bash xml" to also wget the contents of an S3 bucket!
-
chad over 10 years@Alastair I have a whole set of S3 manipulation bash scripts, I'll ask my manager if I can release them.
-
chad over 10 years@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
-
Alastair over 10 yearsGrokkin' contribution, there, @chad ! Checking them out now!
-
William Pursell over 10 yearsAssigning IFS in a local variable is fragile and not necessary. Just do:
IFS=\< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of usingread
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.) -
maverick over 10 yearsCred to the original - this one-liner is so freakin' elegant and amazing.
-
user311174 over 10 yearsgreat hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
-
khmarbaise over 10 yearsI'm trying to use the above two functions which produces the following:
./read_xml.sh: line 22: (-1): substring expression < 0
? -
khmarbaise over 10 yearsLine 22:
[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
-
scavenger about 9 yearssorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
-
Tomalak about 9 yearsDownvoted for attempting to roll your own XML parser. This is an extremely bad idea.
-
Melroy van den Berg over 8 yearsxml has a nested structure & you can have the same 'entity' names, with this way you lost this nested structure. Meaning you can't fetch the information you need. Especially when the entity names are the same, eg.
<cars><car><type>Volvo</type></car><car><type>Audio</type></car></cars>
It's even worse when you want the list of all the 'cars'. -
tripleee almost 8 yearsOn many systems, the
xpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration. -
rubo77 over 7 yearsOn Ubuntu/Debian
apt-get install xmlstarlet
-
peterh over 6 yearsParsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
-
peterh over 6 yearsParsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
-
E. Moffat over 5 yearsThis is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra
pip install
overapt-get
oryum
call. Thanks! -
tres.14159 over 5 yearsIn debian
apt-get install libxml-xpath-perl
. -
Joshua Goldberg over 3 yearsVery useful tool. The link is broken, (see web.archive.org/web/20160312110413/https://dan.egnor.name/xml2 ) but there is a working, frozen clone on github: github.com/clone/xml2
-
Charles Duffy over 3 yearsThere are serious security problems with this approach. You don't want a password containing
$(rm -rf ~)
toeval
that command (and if you changed your injected quotes from double to single, they could then be defeated with$(rm -rf ~)'$(rm -rf ~)'
). -
Charles Duffy over 3 years...so, if you want to make this safe, you need to both (1) switch from injecting double quotes to single quotes; and (2) replace any literal single quotes in the data with a construct like
'"'"'
-
Charles Duffy over 3 yearsAlso,
eval "$(...)"
, not justeval $(...)
. For an example of how the latter leads to buggy results, trycmd=$'printf \'%s\\n\' \'first * line\''
, and then compare the output ofeval $cmd
to the output ofeval "$cmd"
-- without the quotes, your*
gets replaced with a list of files in the current directory beforeeval
starts its parsing (meaning those filenames themselves get evaluated as code, opening even more potential room for security issues). -
phyatt almost 3 years
xpath
is great! Usage is a simplexpath -e 'xpath/expression/here' $filename
and then add a-q
to only show the output so you can pipe it to other places or save to a variable. -
Ihe Onwuka over 2 yearsNever parse XML or JSON without a proper tool is sound advice. The only exception would be if you need to stream the input because of it's size.
-
sean over 2 yearsBroken link for 4xpath.
-
Alex Belous almost 2 yearsReally cool ! But , what about examle like
<Dev 'path=/path/to/my/dev' />
, this will result intoATTRIBUTES=path=/path/to/my/dev' /
, is there an easy way to remove/
?