Regular expression to remove XML tags and their content

58

Solution 1

If you just want to remove all the tags from the string, use this (C#):

try {
    yourstring = Regex.Replace(yourstring, "(<[be]pt[^>]+>.+?</[be]pt>)", "");
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

EDIT:

I decided to add on to my solution with a better option. The previous option would not work if there were embedded tags. This new solution should strip all <**pt*> tags, embedded or not. In addition, this solution uses a back reference to the original [be] match so that the exact matching end tag is found. This solution also creates a reusable Regex object for improved performance so that each iteration does not have to recompile the Regex:

bool FoundMatch = false;

try {
    Regex regex = new Regex(@"<([be])pt[^>]+>.+?</\1pt>");
    while(regex.IsMatch(yourstring) ) {
        yourstring = regex.Replace(yourstring, "");
    }
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

ADDITIONAL NOTES:

In the comments a user expressed worry that the '.' pattern matcher would be cpu intensive. While this is true in the case of a standalone greedy '.', the use of the non-greedy character '?' causes the regex engine to only look ahead until it finds the first match of the next character in the pattern versus a greedy '.' which requires the engine to look ahead all the way to the end of the string. I use RegexBuddy as a regex development tool, and it includes a debugger which lets you see the relative performance of different regex patterns. It also auto comments your regexes if desired, so I decided to include those comments here to explain the regex used above:

    // <([be])pt[^>]+>.+?</\1pt>
// 
// Match the character "<" literally «<»
// Match the regular expression below and capture its match into backreference number 1 «([be])»
//    Match a single character present in the list "be" «[be]»
// Match the characters "pt" literally «pt»
// Match any character that is not a ">" «[^>]+»
//    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the character ">" literally «>»
// Match any single character that is not a line break character «.+?»
//    Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
// Match the characters "</" literally «</»
// Match the same text as most recently matched by backreference number 1 «\1»
// Match the characters "pt>" literally «pt>»

Solution 2

I presume you want to drop the tag entirely?

(<bpt .*?>.*?</bpt>)|(<ept .*?>.*?</ept>)

The ? after the * makes it non-greedy, so it will try to match as few characters as possible.

One problem you'll have is nested tags. stuff would not see the second because the first matched.

Solution 3

Why do you say the overhead is too large? Did you measure it? Or are you guessing?

Using a regex instead of a proper parser is a shortcut that you may run afoul of when someone comes along with something like <bpt foo="bar>">

Share:
58
ensnare
Author by

ensnare

Updated on August 15, 2020

Comments

  • ensnare
    ensnare over 3 years

    I'm trying to write an error handling class for my application. Is it necessary to include the full path to the error handler every time? Below is my code.

    appname/appname/model/error.py

    class UserError(Exception):
      """ User errors
      """
    
      def __init__(self, value):
        self.value = value
    
      def __str__(self):
        return repr(self.value)
    

    My class function:

    from error import UserError
    
    def doSomething(
      """ Some function
      """
      if (value == 2):
        pass
      else:
        raise UserError('Value is not 2')
    

    That is called from my application as follows: from error import UserError

    try:
      print names['first']
    except appname.model.error.UserError as e:
      print e
    

    When raised:

    >> appname.model.error.UserError: 'No file specified'
    

    Do I have to refer to this as "appname.model.error.UserError" all the time? Or is there a way to just refer to this error as UserError or even error.UserError? Where to I adjust the scope of this? Seems like not a good idea in case I change the directory structure (or even name) of my application, no?

    • Torsten Marek
      Torsten Marek over 15 years
      What do you mean by filter? Extract? Remove? Please clarify.
    • Aaron Fischer
      Aaron Fischer over 15 years
      What is the reason for avoiding an XML parser?
    • Vincent
      Vincent over 15 years
      Small strings need to be filtered, so XML Parser overhead is not acceptable. Filter is remove in that case.
    • John Fiala
      John Fiala over 15 years
      Can a <bpt> be nested inside of an <ept>? Or vice-versa? That complicates the problem if so.
    • Torsten Marek
      Torsten Marek over 15 years
      If there's arbitrary nesting, there is no general solution involving regexes, and with limited nesting your regexes get really huge and really ugly.
  • Torsten Marek
    Torsten Marek over 15 years
    Well, using a regex or some other crutch is the only thing you can do when you have non-wellformed XML. The markup in the question is not XML, it has intersecting hierarchies.
  • e-satis
    e-satis over 15 years
    Nice one, except the use of "." which is pretty cpu intensive, that matters if you process a big xml file. You could just replace it by "[^<>]", couldn't you ?
  • e-satis
    e-satis over 15 years
    Sorry, for the subtag, you just can't. Better use "[^ø]" instead.