How to parse XML in Bash?

xml bash xhtml shell xpath

325,713

Solution 1

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=\> ; read -d \< E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do
    if [[ $ENTITY = "title" ]]; then
        echo $CONTENT
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>sth-items</Name>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>[email protected]</Key>
    <LastModified>2011-07-25T22:23:04.000Z</LastModified>
    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
    <Size>1785</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

and the following loop:

while read_dom; do
    echo "$ENTITY => $CONTENT"
done < input.xml

You should get:

 => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 
Name => sth-items
/Name => 
IsTruncated => false
/IsTruncated => 
Contents => 
Key => [email protected]
/Key => 
LastModified => 2011-07-25T22:23:04.000Z
/LastModified => 
ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
/ETag => 
Size => 1785
/Size => 
StorageClass => STANDARD
/StorageClass => 
/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do
    if [[ $ENTITY = "Key" ]] ; then
        echo $CONTENT
    fi
done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT If for some reason local IFS=\> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {
    ORIGINAL_IFS=$IFS
    IFS=\>
    read -d \< ENTITY CONTENT
    IFS=$ORIGINAL_IFS
}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2 To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local ret=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $ret
}

Then write your function to parse and get the data you want like this:

parse_dom () {
    if [[ $TAG_NAME = "foo" ]] ; then
        eval local $ATTRIBUTES
        echo "foo size is: $size"
    elif [[ $TAG_NAME = "bar" ]] ; then
        eval local $ATTRIBUTES
        echo "bar type is: $type"
    fi
}

Then while you read_dom call parse_dom:

while read_dom; do
    parse_dom
done

Then given the following example markup:

<example>
  <bar size="bar_size" type="metal">bars content</bar>
  <foo size="1789" type="unknown">foos content</foo>
</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local RET=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $RET
}

I don't see any reason why that shouldn't work

Solution 2

You can do that very easily using only bash. You only have to add this function:

rdom () { local IFS=\> ; read -d \< E C ;}

Now you can use rdom like read but for html documents. When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do
    if [[ $E = title ]]; then
        echo $C
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

Solution 3

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library
```
sudo apt-get install libxml-xpath-perl
```
Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

Solution 4

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

Solution 5

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

View more solutions

325,713

Zombo

Hello world

Updated on December 04, 2020

Comments

Zombo over 3 years
Ideally, what I would like to be able to do is:
```
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
```
- Ciro Santilli OurBigBook.com over 8 years
  
  unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
Opher about 13 years

Where can I download 'xpath' or '4xpath' from ?
Alex Gray almost 13 years

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
David over 12 years

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
obesechicken13 almost 12 years

The listing is nice but I don't really know where to go from there. So say I wanted to put "1785", the "size" in a variable. How would I do that?
Admin almost 12 years

@obesechicken13 Easy, let's say your variable is named num: look at the very last while loop in chad's answer. Instead of echo $CONTENT put num=$CONTENT.
Admin almost 12 years

For me, the read_dom function only works if I make the IFS global : IFS='>'. I had to remove the local.
chad almost 12 years

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
chad almost 12 years

@obesechicken13, I added an example of parsing attributes.
Andrew Wagner over 11 years

sudo apt-get install libxml-xpath-perl
Bruno von Paris over 11 years

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
Stephen Niedzielski about 11 years

Just because you can write your own parser, doesn't mean you should.
Alastair almost 11 years

@chad it certainly says something about AWS' workflow/implementation that I was searching for an answer to "bash xml" to also wget the contents of an S3 bucket!
chad over 10 years

@Alastair I have a whole set of S3 manipulation bash scripts, I'll ask my manager if I can release them.
chad over 10 years

@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
Alastair over 10 years

Grokkin' contribution, there, @chad ! Checking them out now!
William Pursell over 10 years

Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=\< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
maverick over 10 years

Cred to the original - this one-liner is so freakin' elegant and amazing.
user311174 over 10 years

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
khmarbaise over 10 years

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
khmarbaise over 10 years

Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
scavenger about 9 years

sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
Tomalak about 9 years

Downvoted for attempting to roll your own XML parser. This is an extremely bad idea.
Melroy van den Berg over 8 years

xml has a nested structure & you can have the same 'entity' names, with this way you lost this nested structure. Meaning you can't fetch the information you need. Especially when the entity names are the same, eg. <cars><car><type>Volvo</type></car><car><type>Audio</type></‌car></cars> It's even worse when you want the list of all the 'cars'.
tripleee almost 8 years

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
rubo77 over 7 years

On Ubuntu/Debian apt-get install xmlstarlet
peterh over 6 years

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
peterh over 6 years

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
E. Moffat over 5 years

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
tres.14159 over 5 years

In debian apt-get install libxml-xpath-perl .
Joshua Goldberg over 3 years

Very useful tool. The link is broken, (see web.archive.org/web/20160312110413/https://dan.egnor.name/xm‌l2 ) but there is a working, frozen clone on github: github.com/clone/xml2
Charles Duffy over 3 years

There are serious security problems with this approach. You don't want a password containing $(rm -rf ~) to eval that command (and if you changed your injected quotes from double to single, they could then be defeated with $(rm -rf ~)'$(rm -rf ~)').
Charles Duffy over 3 years

...so, if you want to make this safe, you need to both (1) switch from injecting double quotes to single quotes; and (2) replace any literal single quotes in the data with a construct like '"'"'
Charles Duffy over 3 years

Also, eval "$(...)", not just eval $(...). For an example of how the latter leads to buggy results, try cmd=$'printf \'%s\\n\' \'first * line\'', and then compare the output of eval $cmd to the output of eval "$cmd" -- without the quotes, your * gets replaced with a list of files in the current directory before eval starts its parsing (meaning those filenames themselves get evaluated as code, opening even more potential room for security issues).
phyatt almost 3 years

xpath is great! Usage is a simple xpath -e 'xpath/expression/here' $filename and then add a -q to only show the output so you can pipe it to other places or save to a variable.
Ihe Onwuka over 2 years

Never parse XML or JSON without a proper tool is sound advice. The only exception would be if you need to stream the input because of it's size.
sean over 2 years

Broken link for 4xpath.
Alex Belous almost 2 years

Really cool ! But , what about examle like <Dev 'path=/path/to/my/dev' />, this will result into ATTRIBUTES=path=/path/to/my/dev' /, is there an easy way to remove / ?