Creating bash script to parse xml file to csv

10,799

Solution 1

You've posted a query similar to your pervious one. I'd again suggest using a XML parser. You could say:

xmlstarlet sel -t -m //List/Job -v @name -o "|" -v @id -n file.xml

It would return

John|1
Zack|2
Bob|3

for your sample data.

Pipe the output to sed: sed "s/|/\t| /" if you want it to appear as in your example.

Solution 2

Extending xmlstarlet approach:

Given this xml file as input:

<DATA>
  <RECORD>
    <NAME>John</NAME>
    <SURNAME>Smith</SURNAME>
    <CONTACTS>
      "Smith" LTD,
      London, Mtg Str, 12,
      UK
    </CONTACTS>
  </RECORD>
</DATA>

And this script:

xmlstarlet sel -e utf-8 -t \
  -o "NAME, SURNAME, CONTACTS" -n \
  -m //DATA/RECORD \
  -o "\"" \
  -v $"str:replace(normalize-space(NAME), '\"', '\"\"')" -o "\",\"" \
  -v $"str:replace(normalize-space(SURNAME),      '\"', '\"\"')" -o "\",\"" \
  -v $"str:replace(normalize-space(CONTACTS), '\"', '\"\"')" -o "\",\"" \
  -o "\"" \
  -n file.xml

You'll have the following output:

NAME, SURNAME, CONTACTS
"John", "Smith", """Smith"" LTD, London, Mtg Str, 12, UK"

Solution 3

Try something like this

#!/bin/bash
while read -r line; do
  [[ $line =~ "name=\""(.*)"\"" ]] && name="${BASH_REMATCH[1]}" && [[ $line =~ "Job id=\""([^\"]+) ]] &&  echo "$name | ${BASH_REMATCH[1]}"
done < file 

The line with John is malformed. With it fixed, example output

John | 1
Zack | 2
Bob | 3

Solution 4

Using sed

sed -nr 's/.*id=\"([0-9]*)\"[^\"]*\"(\w*).*/\2 | \1/p' file

Additional, base on BroSlow's cript, I merge the options.

#!/bin/bash

while read -r line; do
  [[ $line =~ id=\"([0-9]+).*name=\"([^\"|/]*) ]] && echo "${BASH_REMATCH[2]} | ${BASH_REMATCH[1]}"
done < file
Share:
10,799
user3259914
Author by

user3259914

Updated on June 04, 2022

Comments

  • user3259914
    user3259914 almost 2 years

    I'm trying to create a bash script to parse an xml file and save it to a csv file.

    For example:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
        <List>
        <Job id="1" name="John/>
        <Job id="2" name="Zack"/>
        <Job id="3" name="Bob"/>
    </List>
    

    I would like the script to save information into a csv file as such:

    John | 1
    Zack | 2
    Bob  | 3
    

    The name and id will be in a different cell.

    Is there any way I can do this?

  • BMW
    BMW over 10 years
    in this instance name="John/>, there is no double quota after John, so recommend to replace [[ $line =~ "name=\""(.*)"\"" ]] to [[ $line =~ "name=\""([^\"|/]*) ]]
  • Reinstate Monica Please
    Reinstate Monica Please over 10 years
    @BMW Thanks. I assumed it shouldn't be malformed xml, but if it is could do that or something like ([A-Za-z]*)
  • Dominik
    Dominik about 8 years
    dude, can u elaborate on that short script? I am quite confused. :) nevertheless its looking crazy good.
  • Diego1974
    Diego1974 over 4 years
    This is a good solution, and elegant. Just I got: compilation error: element with-param XSLT-with-param: Failed to compile select expression 'str:replace' because of unclosed parenthesis in normalize-space call; should read "str:replace(normalize-space(NAME) , '\"', '\"\"')"
  • Neek
    Neek about 2 years
    Thanks for this. Anyone else extracting URLs from XML may find the &amp; isn't escaped. Fix this by adding -T after the sel command, e.g. xmlstarlet sel -T -e utf-8...... (see stackoverflow.com/questions/46255304/…)