find xml file that contain specific tag name and print the words between tag name
Solution 1
A simple solution would be to use sed:
find /tmp -name '*.xml' -exec sed -n 's/<Name>\([^<]*\)<\/Name>/\1/p' {} +
The regex matches the tags and prints what is in between. If we remove the scape characters is easier to read:
s / <Name>([^<]*)</Name> / \1
The parenthesis matches any character which is not "<" and gets mapped to \1
As mentioned in the comments this would be a simple solution, regular expressions can not cope with all of the possible variations of structured text. So if you have multiple lines between tags or other tags it won't work and you will need to use a real xml parser
Solution 2
For a simple XML job like this, I'd use xml2 and cut
. (or sed
, or awk
, or perl
).
find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; | grep '/Name=' |
cut -d '=' -f2-
or
find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; | sed -n -e 's/^[^=]*\/Name=//p'
or
find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; |
awk -F'=' '/Name=/ {$1=""; sub(/^ /,"",$0); print }'
(The sub()
function call in the awk version strips the leading space left after setting $1 to "" - awk doesn't have a way of deleting fields from the input line, the best you can do is set it to the empty string and clean up afterwards. Alternatively, split()
the line into an an array, delete the field(s) you don't want, and then join the array into a string for printing. awk
doesn't have a join()
function like perl
so you'll have to write your own)
or
find . -iname '*.xml' -exec bash -c 'xml2 < {}' \; |
perl -F= -lane 'if (m:/Name=:) { delete @F[0]; print @F}'
xml2
converts XML formatted data into a line-oriented format suitable for processing with line-oriented text utilities like awk
, or sed
, or perl
and many others. It comes with a corresponding 2xml
program which can convert that line-oriented format back to properly formatted XML.
For more complicated tasks, I'd use xmlstarlet
xmlstarlet
is an XML processing tool that you can use to list, query, extract, and modify data in XML files.
Both are available packaged for debian and other Linux distros.
The, IMO, best solution is to use a language like perl
or python
that has an XML parsing library, and use that. xmlstarlet
is great for working with XML files in shell, but constructing the command-line for very complicated searches becomes more work (and much harder to read and debug) than just writing a script in perl
or python
to do the job. That's partly because I do a lot more programming in those languages and find it much easier to work with...but mostly because IMO it's better to concentrate your learning effort on general-purpose languages that can be used for a wide variety of tasks than on domain-specific languages/tools that can only be used for one very specific thing.
Related videos on Youtube
yael
Updated on September 18, 2022Comments
-
yael almost 2 years
we can find xml file type as the follwing
find /tmp/ -type f -name '*.xml'
but how to change the syntax in order to find only xml that contain with:
<Name>some words</Name>
and print what is between:
<Name> ------ </Name>
expected output
some words
example - xml file contain:
<Name>files_with_extra_data</Name>
expected output
files_with_extra_data
-
Weijun Zhou over 6 yearsNot if the XML tag can span multiple lines.
-
Blasco over 6 years@WeijunZhou True, but I think that it's not the problem that he was describing though
-
Alessio over 6 yearsThe input is an xml file. It's always going to be a potential problem. regular expressions can not cope with all of the possible variations of structured text. that's why you need to use a real xml parser.
-
Blasco over 6 years@cas I added your warnings, I still thing this solution might be of use for him
-
yael over 6 yearsafter your change its print all xml file , insted to print the words between <Name .... </Name>
-
yael over 6 yearsthis syntax is working - find /tmp -name '*.xml' -exec sed -n 's/<Name>([^<]*)<\/Name>/\1/p' {} +
-
Alessio over 6 yearsI just came back to add that for really complicated XML tasks, I'd use a language that had XML parsing libraries. Like perl, or python. Both have several to choose from, of varying complexity and capability. I'm inclined to use perl, so I'd use
XML::Parser
orXML::Twig
orXML::Simple
depending on what I needed to do. That ("use a language with an XML lib") is the correct solution if you don't havexml2
orsxmltarlet
. -
Blasco over 6 years@yael ups sorry, you are right, my edit brow it. I just wanted to remove the -n option to make the command easier to remember. I think now it works
-
yael over 6 yearsstill the update syntax still print all xml file and not the requited output - find /tmp -type f -name '*.xml' -exec sed 's/<Name>([^<]*)<\/Name>/\1/' {} +
-
yael over 6 yearswhat is working so far is that - find /tmp -name '*.xml' -exec sed -n 's/<Name>([^<]*)<\/Name>/\1/p' {} +
-
yael over 6 years@WooWapDaBug could you please cheek the last update , as you know the first answer is working
-
Blasco over 6 years@yael Restored to the first version. It works in my machine
-
Scott - Слава Україні over 6 years“… that contain [sic] specific tag name …” I believe that a correct answer must search for
Name
. -
Alessio over 6 years@Scott that somehow got dropped when i copy-pasted from my terminal and i didn't notice. thanks for pointing it out.
-
Scott - Слава Україні almost 4 yearsPlease add an explanation of how this works and how it differs from the other answers. … … … … Please do not respond in comments; edit your answer to make it clearer and more complete.