Find everything between two XML tags with RegEx

191,069

Solution 1

It is not a good idea to use regex for HTML/XML parsing...

However, if you want to do it anyway, search for regex pattern

<primaryAddress>[\s\S]*?<\/primaryAddress>

and replace it with empty string...

Solution 2

You should be able to match it with: /<primaryAddress>(.+?)<\/primaryAddress>/

The content between the tags will be in the matched group.

Solution 3

It is not good to use this method but if you really want to split it with regex

<primaryAddress.*>((.|\n)*?)<\/primaryAddress>

the verified answer returns the tags but this just return the value between tags.

Solution 4

this can capture most outermost layer pair of tags, even with attribute in side or without end tags

(<!--((?!-->).)*-->|<\w*((?!\/<).)*\/>|<(?<tag>\w+)[^>]*>(?>[^<]|(?R))*<\/\k<tag>\s*>)

edit: as mentioned in comment above, regex is always not enough to parse xml, trying to modify the regex to fit more situation only makes it longer but still useless

Share:
191,069
Doz
Author by

Doz

Updated on July 22, 2022

Comments

  • Doz
    Doz almost 2 years

    In RegEx, I want to find the tag and everything between two XML tags, like the following:

    <primaryAddress>
        <addressLine>280 Flinders Mall</addressLine>
        <geoCodeGranularity>PROPERTY</geoCodeGranularity>
        <latitude>-19.261365</latitude>
        <longitude>146.815585</longitude>
        <postcode>4810</postcode>
        <state>QLD</state>
        <suburb>Townsville</suburb>
        <type>PHYSICAL</type>
    </primaryAddress>
    

    I want to find the tag and everything between primaryAddress, and erase that.

    Everything between the primaryAddress tag is a variable, but I want to remove the entire tag and sub-tags whenever I get primaryAddress.

    Anyone have any idea how to do that?

  • Gianluca Ghettini
    Gianluca Ghettini over 11 years
    Just for curiosity's sake: why is not a good idea to use regex for HTML/XML parsing?
  • Ωmega
    Ωmega over 11 years
  • Doz
    Doz over 11 years
    Yeah i just want to find using TextMate, im not doing this in code or anything. But the example you gave me doesnt work. There is a space after <primaryAddress> and before </primaryAdddress>
  • Ωmega
    Ωmega over 11 years
    @Doz - I don't know what syntax uses TextMate. Your question does not mention any specific information and is tagged with regex, so I have posted general regex solution that is working with majority of regex tools and programming languages. If you need further help, I suggest you to post a new question where you will be more specific about your requiremenets...
  • Doz
    Doz over 11 years
    Omega, I just wanted to get generic information on regex, i only said i use textmate in response to people marking down my question because its a bad idea to use RegEx. I know it is a bad idea but i am using it within a different context.
  • Ωmega
    Ωmega over 11 years
    @Doz - So then you got the general information in my answer... Good luck!
  • Seth
    Seth almost 9 years
    Just in case you don't recognize it, *? means match everything up to the first occurence of </primaryAddress> (non-greedy match). This is important if your file has multiple <primaryAddress> elements in it. Thanks, @Ωmega.
  • JMM
    JMM over 8 years
    This worked great for me, but in particular, anyone using this needs to be aware that it can't handle nested tags. IE, if there was a primaryAddress node as one of the descendents of another primaryAddress node. So make sure that's not a possibility in your xml document.
  • Magnilex
    Magnilex over 8 years
    @Ωmega Agreed that regex and xml are not best friend. However, I just replaced 40-50 tags with an empty line through my IDE (IntelliJ IDEA), in about 5 seconds with help from your answer. In these cases, this regex and xml can be useful.
  • Dima Naychuk
    Dima Naychuk almost 7 years
    Great, this also works in case of new line characters inside of tag body. To catch also parametrized tags, e.g. <primaryAddress isValid=True>, I would suggest small update: <primaryAddress.*?>[\\s\\S]*?</primaryAddress>
  • Ωmega
    Ωmega almost 7 years
    @DimaNaychuk - In such case use <primaryAddress[^>]*>[\s\S]*?<\/primaryAddress>
  • Andrii Karaivanskyi
    Andrii Karaivanskyi over 6 years
    Apparently it won't work even for the example in the question. .+ does not match carriage return symbols.
  • doublesharp
    doublesharp over 6 years
    You would use a multi-line flag.
  • Crashalot
    Crashalot almost 4 years
    @Seth thanks for the non-greedy match, tip! why use [\s\S]*? instead of .*??
  • Seth
    Seth almost 4 years
    @Crashalot the dot might not match a newline character. See the regex docs for your platform / language.
  • Crashalot
    Crashalot almost 4 years
    @Seth thanks for the reply! yes just discovered this. :)