Using regex to get text between multiple HTML tags

78,353

Solution 1

Replace your pattern with a non greedy match

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<\\/div>";

    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);

    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);

    Console.ReadLine();
}

Solution 2

As other guys didn't mention HTML tags with attributes, here is my solution to deal with that:

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World

Solution 3

I think this code should work:

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }

Solution 4

The short version is that you cannot do this correctly in all situations. There will always be cases of valid HTML for which a regular expression will fail to extract the information you want.

The reason is because HTML is a context free grammar which is a more complex class than a regular expression.

Here's an example -- what if you have multiple stacked divs?

<div><div>stuff</div><div>stuff2</div></div>

The regexes listed as other answers will grab:

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

because that's what regular expressions do when they try to parse HTML.

You can't write a regular expression that understands how to interpret all of the cases, because regular expressions are incapable of doing so. If you are dealing with a very specific constrained set of HTML, it may be possible, but you should keep this fact in mind.

More information: https://stackoverflow.com/a/1732454/2022565

Solution 5

Have you looked at the Html Agility Pack (see https://stackoverflow.com/a/857926/618649)?

CsQuery also looks pretty useful (basically use CSS selector-style syntax to get the elements). See https://stackoverflow.com/a/11090816/618649.

CsQuery is basically meant to be "jQuery for C#," which is pretty much the exact search criteria I used to find it.

If you could do this in a web browser, you could easily use jQuery, using syntax similar to $("div").each(function(idx){ alert( idx + ": " + $(this).text()); } (only you would obviously output the result to the log, or the screen, or make a web service call with it, or whatever you need to do with it).

Share:
78,353
Ben
Author by

Ben

Updated on February 06, 2020

Comments

  • Ben
    Ben over 4 years

    Using regex, I want to be able to get the text between multiple DIV tags. For instance, the following:

    <div>first html tag</div>
    <div>another tag</div>
    

    Would output:

    first html tag
    another tag
    

    The regex pattern I am using only matches my last div tag and misses the first one. Code:

        static void Main(string[] args)
        {
            string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
            string pattern = "(<div.*>)(.*)(<\\/div>)";
    
            MatchCollection matches = Regex.Matches(input, pattern);
            Console.WriteLine("Matches found: {0}", matches.Count);
    
            if (matches.Count > 0)
                foreach (Match m in matches)
                    Console.WriteLine("Inner DIV: {0}", m.Groups[2]);
    
            Console.ReadLine();
        }
    

    Output:

    Matches found: 1

    Inner DIV: This is ANOTHER test

  • Ben
    Ben about 11 years
    It found both of the matches but displays empty value(s) on my program
  • coolmine
    coolmine about 11 years
    The above code should work, note that its m.Groups[1] and not m.Groups[2] as I changed it a bit since there is no reason to capture the tag itself. rubular.com/r/XQrcobmfAK
  • Craig Tullis
    Craig Tullis over 7 years
    A downvote without any explanation or comment. Thanks! The fact is that HTML/XML are notoriously a pain in the neck to deal with using Regex. Not that you can't do it, and I certainly have on numerous occasions, but CSS selector syntax is a much cleaner proposition.