Using regex to get text between multiple HTML tags
Solution 1
Replace your pattern with a non greedy match
static void Main(string[] args)
{
string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
string pattern = "<div.*?>(.*?)<\\/div>";
MatchCollection matches = Regex.Matches(input, pattern);
Console.WriteLine("Matches found: {0}", matches.Count);
if (matches.Count > 0)
foreach (Match m in matches)
Console.WriteLine("Inner DIV: {0}", m.Groups[1]);
Console.ReadLine();
}
Solution 2
As other guys didn't mention HTML tags with attributes
, here is my solution to deal with that:
// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World
Solution 3
I think this code should work:
string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
{
l.Add(match.Groups[1].Value);
}
Solution 4
The short version is that you cannot do this correctly in all situations. There will always be cases of valid HTML for which a regular expression will fail to extract the information you want.
The reason is because HTML is a context free grammar which is a more complex class than a regular expression.
Here's an example -- what if you have multiple stacked divs?
<div><div>stuff</div><div>stuff2</div></div>
The regexes listed as other answers will grab:
<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>
because that's what regular expressions do when they try to parse HTML.
You can't write a regular expression that understands how to interpret all of the cases, because regular expressions are incapable of doing so. If you are dealing with a very specific constrained set of HTML, it may be possible, but you should keep this fact in mind.
More information: https://stackoverflow.com/a/1732454/2022565
Solution 5
Have you looked at the Html Agility Pack (see https://stackoverflow.com/a/857926/618649)?
CsQuery also looks pretty useful (basically use CSS selector-style syntax to get the elements). See https://stackoverflow.com/a/11090816/618649.
CsQuery is basically meant to be "jQuery for C#," which is pretty much the exact search criteria I used to find it.
If you could do this in a web browser, you could easily use jQuery, using syntax similar to $("div").each(function(idx){ alert( idx + ": " + $(this).text()); }
(only you would obviously output the result to the log, or the screen, or make a web service call with it, or whatever you need to do with it).
Ben
Updated on February 06, 2020Comments
-
Ben over 4 years
Using regex, I want to be able to get the text between multiple DIV tags. For instance, the following:
<div>first html tag</div> <div>another tag</div>
Would output:
first html tag another tag
The regex pattern I am using only matches my last div tag and misses the first one. Code:
static void Main(string[] args) { string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>"; string pattern = "(<div.*>)(.*)(<\\/div>)"; MatchCollection matches = Regex.Matches(input, pattern); Console.WriteLine("Matches found: {0}", matches.Count); if (matches.Count > 0) foreach (Match m in matches) Console.WriteLine("Inner DIV: {0}", m.Groups[2]); Console.ReadLine(); }
Output:
Matches found: 1
Inner DIV: This is ANOTHER test
-
Ben about 11 yearsIt found both of the matches but displays empty value(s) on my program
-
coolmine about 11 yearsThe above code should work, note that its m.Groups[1] and not m.Groups[2] as I changed it a bit since there is no reason to capture the tag itself. rubular.com/r/XQrcobmfAK
-
Craig Tullis over 7 yearsA downvote without any explanation or comment. Thanks! The fact is that HTML/XML are notoriously a pain in the neck to deal with using Regex. Not that you can't do it, and I certainly have on numerous occasions, but CSS selector syntax is a much cleaner proposition.