find xml file that contain specific tag name and print the words between tag name

6,387

Solution 1

A simple solution would be to use sed:

find /tmp -name '*.xml' -exec sed -n 's/<Name>\([^<]*\)<\/Name>/\1/p' {} +

The regex matches the tags and prints what is in between. If we remove the scape characters is easier to read:

s / <Name>([^<]*)</Name> / \1 

The parenthesis matches any character which is not "<" and gets mapped to \1

As mentioned in the comments this would be a simple solution, regular expressions can not cope with all of the possible variations of structured text. So if you have multiple lines between tags or other tags it won't work and you will need to use a real xml parser

Solution 2

For a simple XML job like this, I'd use xml2 and cut. (or sed, or awk, or perl).

find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; | grep '/Name=' |
  cut -d '=' -f2-

or

find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; | sed -n -e 's/^[^=]*\/Name=//p'

or

find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; | 
  awk -F'=' '/Name=/ {$1=""; sub(/^ /,"",$0); print }'

(The sub() function call in the awk version strips the leading space left after setting $1 to "" - awk doesn't have a way of deleting fields from the input line, the best you can do is set it to the empty string and clean up afterwards. Alternatively, split() the line into an an array, delete the field(s) you don't want, and then join the array into a string for printing. awk doesn't have a join() function like perl so you'll have to write your own)

or

find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; |
  perl -F= -lane 'if (m:/Name=:) { delete @F[0]; print @F}'

xml2 converts XML formatted data into a line-oriented format suitable for processing with line-oriented text utilities like awk, or sed, or perl and many others. It comes with a corresponding 2xml program which can convert that line-oriented format back to properly formatted XML.

For more complicated tasks, I'd use xmlstarlet

xmlstarlet is an XML processing tool that you can use to list, query, extract, and modify data in XML files.

Both are available packaged for debian and other Linux distros.


The, IMO, best solution is to use a language like perl or python that has an XML parsing library, and use that. xmlstarlet is great for working with XML files in shell, but constructing the command-line for very complicated searches becomes more work (and much harder to read and debug) than just writing a script in perl or python to do the job. That's partly because I do a lot more programming in those languages and find it much easier to work with...but mostly because IMO it's better to concentrate your learning effort on general-purpose languages that can be used for a wide variety of tasks than on domain-specific languages/tools that can only be used for one very specific thing.

Share:
6,387

Related videos on Youtube

yael
Author by

yael

Updated on September 18, 2022

Comments

  • yael
    yael almost 2 years

    we can find xml file type as the follwing

    find /tmp/ -type f -name '*.xml'
    

    but how to change the syntax in order to find only xml that contain with:

    <Name>some words</Name>
    

    and print what is between:

    <Name> ------ </Name>
    

    expected output

    some words      
    

    example - xml file contain:

    <Name>files_with_extra_data</Name>
    

    expected output

    files_with_extra_data
    
  • Weijun Zhou
    Weijun Zhou over 6 years
    Not if the XML tag can span multiple lines.
  • Blasco
    Blasco over 6 years
    @WeijunZhou True, but I think that it's not the problem that he was describing though
  • Alessio
    Alessio over 6 years
    The input is an xml file. It's always going to be a potential problem. regular expressions can not cope with all of the possible variations of structured text. that's why you need to use a real xml parser.
  • Blasco
    Blasco over 6 years
    @cas I added your warnings, I still thing this solution might be of use for him
  • yael
    yael over 6 years
    after your change its print all xml file , insted to print the words between <Name .... </Name>
  • yael
    yael over 6 years
    this syntax is working - find /tmp -name '*.xml' -exec sed -n 's/<Name>([^<]*)<\/Name>/\1/p' {} +
  • Alessio
    Alessio over 6 years
    I just came back to add that for really complicated XML tasks, I'd use a language that had XML parsing libraries. Like perl, or python. Both have several to choose from, of varying complexity and capability. I'm inclined to use perl, so I'd use XML::Parser or XML::Twig or XML::Simple depending on what I needed to do. That ("use a language with an XML lib") is the correct solution if you don't have xml2 or sxmltarlet.
  • Blasco
    Blasco over 6 years
    @yael ups sorry, you are right, my edit brow it. I just wanted to remove the -n option to make the command easier to remember. I think now it works
  • yael
    yael over 6 years
    still the update syntax still print all xml file and not the requited output - find /tmp -type f -name '*.xml' -exec sed 's/<Name>([^<]*)<\/Name>/\1/' {} +
  • yael
    yael over 6 years
    what is working so far is that - find /tmp -name '*.xml' -exec sed -n 's/<Name>([^<]*)<\/Name>/\1/p' {} +
  • yael
    yael over 6 years
    @WooWapDaBug could you please cheek the last update , as you know the first answer is working
  • Blasco
    Blasco over 6 years
    @yael Restored to the first version. It works in my machine
  • Scott - Слава Україні
    Scott - Слава Україні over 6 years
    “… that contain [sic] specific tag name …” I believe that a correct answer must search for Name.
  • Alessio
    Alessio over 6 years
    @Scott that somehow got dropped when i copy-pasted from my terminal and i didn't notice. thanks for pointing it out.
  • Scott - Слава Україні
    Scott - Слава Україні almost 4 years
    Please add an explanation of how this works and how it differs from the other answers. … … … … Please do not respond in comments; edit your answer to make it clearer and more complete.