How to parse XML in Bash?

325,713

Solution 1

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=\> ; read -d \< E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do
    if [[ $ENTITY = "title" ]]; then
        echo $CONTENT
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>sth-items</Name>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>[email protected]</Key>
    <LastModified>2011-07-25T22:23:04.000Z</LastModified>
    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
    <Size>1785</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

and the following loop:

while read_dom; do
    echo "$ENTITY => $CONTENT"
done < input.xml

You should get:

 => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 
Name => sth-items
/Name => 
IsTruncated => false
/IsTruncated => 
Contents => 
Key => [email protected]
/Key => 
LastModified => 2011-07-25T22:23:04.000Z
/LastModified => 
ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
/ETag => 
Size => 1785
/Size => 
StorageClass => STANDARD
/StorageClass => 
/Contents => 

So if we wrote a while loop like Yuzem's:

while read_dom; do
    if [[ $ENTITY = "Key" ]] ; then
        echo $CONTENT
    fi
done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT If for some reason local IFS=\> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {
    ORIGINAL_IFS=$IFS
    IFS=\>
    read -d \< ENTITY CONTENT
    IFS=$ORIGINAL_IFS
}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2 To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local ret=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $ret
}

Then write your function to parse and get the data you want like this:

parse_dom () {
    if [[ $TAG_NAME = "foo" ]] ; then
        eval local $ATTRIBUTES
        echo "foo size is: $size"
    elif [[ $TAG_NAME = "bar" ]] ; then
        eval local $ATTRIBUTES
        echo "bar type is: $type"
    fi
}

Then while you read_dom call parse_dom:

while read_dom; do
    parse_dom
done

Then given the following example markup:

<example>
  <bar size="bar_size" type="metal">bars content</bar>
  <foo size="1789" type="unknown">foos content</foo>
</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local RET=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $RET
}

I don't see any reason why that shouldn't work

Solution 2

You can do that very easily using only bash. You only have to add this function:

rdom () { local IFS=\> ; read -d \< E C ;}

Now you can use rdom like read but for html documents. When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do
    if [[ $E = title ]]; then
        echo $C
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

Solution 3

Command-line tools that can be called from shell scripts include:

  • 4xpath - command-line wrapper around Python's 4Suite package

  • XMLStarlet

  • xpath - command-line wrapper around Perl's XPath library

    sudo apt-get install libxml-xpath-perl
    
  • Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

Solution 4

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

Solution 5

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
Share:
325,713

Related videos on Youtube

Zombo
Author by

Zombo

Hello world

Updated on December 04, 2020

Comments

  • Zombo
    Zombo over 3 years

    Ideally, what I would like to be able to do is:

    cat xhtmlfile.xhtml |
    getElementViaXPath --path='/html/head/title' |
    sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
    
  • Opher
    Opher about 13 years
    Where can I download 'xpath' or '4xpath' from ?
  • Alex Gray
    Alex Gray almost 13 years
    could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
  • David
    David over 12 years
    yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
  • obesechicken13
    obesechicken13 almost 12 years
    The listing is nice but I don't really know where to go from there. So say I wanted to put "1785", the "size" in a variable. How would I do that?
  • Admin
    Admin almost 12 years
    @obesechicken13 Easy, let's say your variable is named num: look at the very last while loop in chad's answer. Instead of echo $CONTENT put num=$CONTENT.
  • Admin
    Admin almost 12 years
    For me, the read_dom function only works if I make the IFS global : IFS='>'. I had to remove the local.
  • chad
    chad almost 12 years
    If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
  • chad
    chad almost 12 years
    @obesechicken13, I added an example of parsing attributes.
  • Andrew Wagner
    Andrew Wagner over 11 years
    sudo apt-get install libxml-xpath-perl
  • Bruno von Paris
    Bruno von Paris over 11 years
    Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
  • Stephen Niedzielski
    Stephen Niedzielski about 11 years
    Just because you can write your own parser, doesn't mean you should.
  • Alastair
    Alastair almost 11 years
    @chad it certainly says something about AWS' workflow/implementation that I was searching for an answer to "bash xml" to also wget the contents of an S3 bucket!
  • chad
    chad over 10 years
    @Alastair I have a whole set of S3 manipulation bash scripts, I'll ask my manager if I can release them.
  • chad
    chad over 10 years
    @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
  • Alastair
    Alastair over 10 years
    Grokkin' contribution, there, @chad ! Checking them out now!
  • William Pursell
    William Pursell over 10 years
    Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=\< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
  • maverick
    maverick over 10 years
    Cred to the original - this one-liner is so freakin' elegant and amazing.
  • user311174
    user311174 over 10 years
    great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
  • khmarbaise
    khmarbaise over 10 years
    I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
  • khmarbaise
    khmarbaise over 10 years
    Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
  • scavenger
    scavenger about 9 years
    sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
  • Tomalak
    Tomalak about 9 years
    Downvoted for attempting to roll your own XML parser. This is an extremely bad idea.
  • Melroy van den Berg
    Melroy van den Berg over 8 years
    xml has a nested structure & you can have the same 'entity' names, with this way you lost this nested structure. Meaning you can't fetch the information you need. Especially when the entity names are the same, eg. <cars><car><type>Volvo</type></car><car><type>Audio</type></‌​car></cars> It's even worse when you want the list of all the 'cars'.
  • tripleee
    tripleee almost 8 years
    On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
  • rubo77
    rubo77 over 7 years
    On Ubuntu/Debian apt-get install xmlstarlet
  • peterh
    peterh over 6 years
    Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
  • peterh
    peterh over 6 years
    Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
  • E. Moffat
    E. Moffat over 5 years
    This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
  • tres.14159
    tres.14159 over 5 years
    In debian apt-get install libxml-xpath-perl .
  • Joshua Goldberg
    Joshua Goldberg over 3 years
    Very useful tool. The link is broken, (see web.archive.org/web/20160312110413/https://dan.egnor.name/xm‌​l2 ) but there is a working, frozen clone on github: github.com/clone/xml2
  • Charles Duffy
    Charles Duffy over 3 years
    There are serious security problems with this approach. You don't want a password containing $(rm -rf ~) to eval that command (and if you changed your injected quotes from double to single, they could then be defeated with $(rm -rf ~)'$(rm -rf ~)').
  • Charles Duffy
    Charles Duffy over 3 years
    ...so, if you want to make this safe, you need to both (1) switch from injecting double quotes to single quotes; and (2) replace any literal single quotes in the data with a construct like '"'"'
  • Charles Duffy
    Charles Duffy over 3 years
    Also, eval "$(...)", not just eval $(...). For an example of how the latter leads to buggy results, try cmd=$'printf \'%s\\n\' \'first * line\'', and then compare the output of eval $cmd to the output of eval "$cmd" -- without the quotes, your * gets replaced with a list of files in the current directory before eval starts its parsing (meaning those filenames themselves get evaluated as code, opening even more potential room for security issues).
  • phyatt
    phyatt almost 3 years
    xpath is great! Usage is a simple xpath -e 'xpath/expression/here' $filename and then add a -q to only show the output so you can pipe it to other places or save to a variable.
  • Ihe Onwuka
    Ihe Onwuka over 2 years
    Never parse XML or JSON without a proper tool is sound advice. The only exception would be if you need to stream the input because of it's size.
  • sean
    sean over 2 years
    Broken link for 4xpath.
  • Alex Belous
    Alex Belous almost 2 years
    Really cool ! But , what about examle like <Dev 'path=/path/to/my/dev' />, this will result into ATTRIBUTES=path=/path/to/my/dev' /, is there an easy way to remove / ?