Finding HTML strings in document

10,475

Solution 1

DO NOT PARSE HTML USING Regular Expressions!!!


Instead, use the HTML Agility Pack.

For example:

var doc = new HtmlDocument();
doc.Load(...);

var pTags = doc.DocumentNode.Descendants("p");

EDIT: You can do this even if the document isn't actually HTML.

Solution 2

Using a regex for this is not the best idea. I suggest reading this thread:

RegEx match open tags except XHTML self-contained tags

Solution 3

While others have said that you shouldn't be doing this with regular expressions, the reason yours is failing is that there is more HTML between your <p> tags and your exclusion of > is causing the Regex to not match.

Solution 4

@"(?is)<p>(?>(?:(?!</?p>).)*)</p>"

(?:(?!</?p>).)* matches one character at a time, after doing a lookahead to make sure it isn't part of a <p> or </p> tag.

(?>...) is an atomic group; it prevents backtracking that we know would be pointless.

(?is) is an alternative mechanism for specifying match modifiers--in this case, IgnoreCase and Singleline (the latter in case there are linefeeds or carriage returns between the tags, which would be redundant, but you did say it's not really HTML).

By the way, < and > have no special meaning in regexes, so there's no need to escape them. In fact, in some flavors you can give them special meanings by escaping them: \< and \> mean "beginning of word" and "end of word" respectively. But in .NET regexes the backslashes are just clutter.

Solution 5

The approach of using a regex to match HTML elements is destined to fail. A regular expression is not capable of reliably matching an HTML element. It's possible to build a more complex HTML element than your regex can match.

For example, i could beat your regex with the following

<p>hello<p>again</p></p>

Instead of using a regex you need to use an HTML (or potentially an XML) parser / DOM. This is the only way to reliably query an HTML file

Detailed Explanation of why:

Share:
10,475
inutan
Author by

inutan

Software developer with experience in MVC, C# and SQL Server.

Updated on June 29, 2022

Comments

  • inutan
    inutan almost 2 years

    I want to get all HTML <p>...</p> in a document.
    Using Regex to find all such strings using:

    Regex regex = new Regex(@"\<p\>([^\>]*)\</p\>", RegexOptions.IgnoreCase);
    

    But I am not able to get any result. Is there anything wrong with my regular expression.?

    For now, I just want to get everything that comes in between <p>...</p> tags and want to use Regex for this as the source is not an HTML document.