Regular expression for getting text between XML elements

13,475

Solution 1

Escaping is only needed for literals, but some languages use \ to escape characters in strings themselves, forcing you to use \\ in the string to mean \ in regex land. And trying to pull off \\ (a literal \ in regex) can be \\\\ in such languages. I think this can be the cause of the confusion when seeing \\ in example code.

Improving the regex:

If someone wanted to be a douche and construct an irregular expression like:

< _some_tag some="stuff" >
    some <strong>content</strong>
< / _some_tag >

You can use this more generic regex that will capture the tag name, content and attributes.

<\s*([A-Za-z_]\w*)\s*([^\>]+)>(.*?)<\s*\/\s\1\s*>

Note that .*? is required in case the same tag exists further in the page, otherwise keeping it greedy will make it capture everything until the last tag with that name closes. Also <tag1>blah</tag2> is obviously bogus, but if you wanted to have that flexible you could just change the last part of this regex.

Solution 2

Use:

<(\w*)>.*</(\w*)>

\\w – literal \, then w
\\ – literal \

Solution 3

Your basic problem is that your regex is "greedy", meaning it will match from the first tag to the last, including nested tags. To make it non-greedy, use the non-greedy syntax .*? (instead of .*).

The other problem is you need to match your tags - use a "back reference": \1 means "the first captured group".

This regex should do it:

<(\w+)>.*?</\1>

It uses a non-greedy capture between matching open/close tags.

Although you are working in java, I left out the escaping of backslashes as \\ to make the regxes readable.

Solution 4

Like every other attempt to access XML using regular expressions, your attempt is wrong. Wrong it two ways: it won't match every legal way of writing this piece of XML (have you checked where spaces are allowed in tags?), and it will match some things that it shouldn't (e.g. stuff that looks like XML but is inside a comment or CDATA section).

Now there are cases where wrong code is acceptable, e.g. if you're screen-scraping and are happy with an 80% success rate. But if that's the case, you need to state it as an explicit requirement on the solution.

The reason you'll never get a 100% success rate is that XML is not a regular language. That's a technical term. Some basic computer science theory tells you that regular expressions can only be used to process regular languages.

You'll probably find that using an XML parser is faster anyway. I once had a system that was performing 30 times too slowly and fixed the problem by replacing regex code with proper parsing.

Solution 5

This would work

<[^>]*>[^<]*<[^>]*>

Matching the absence of the angled brackets. But both these examples would match

<tag1>blah</tag2>

but would you want matching tags for XML i.e.

<tag1>blah</tag1>

In that case you would need a solution with back references. See this SO question for details

This example uses back references

<([^>]*)>[^<]*</\1>

so would match

<tag1>blah</tag1>

but not

<tag1>blah</tag2>

I know that's not what you asked but i think it is want you want for XML tag matching

Share:
13,475
dublintech
Author by

dublintech

Updated on June 28, 2022

Comments

  • dublintech
    dublintech almost 2 years

    I am looking at this regular expressions

    <(\\w*)>\\.*</(\\w*)>
    

    Going thru tutorials etc. I understand it as reading, match anything that follows the form

    <tag1>blah</tag1>
    

    i.e. an XML element, some text and a closing XML element. However, when I run it on various regular expression checkers for example, Expresso it is not matching what I think it should.

    Note: to complicate matters further this regular expression is in Java which as I understand means there are some subtle differences.

    What are my missing?

    Anything appreciated...

    Thanks