Regex to extract the contents of a <div> tag
Solution 1
Your regex works for your example. There are some improvements that should be made, though:
<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>
[^<>]*
means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.
.*?
(note the ?
) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry">
tag in your page.
But your regex itself should still have matched something. Perhaps you're not using it correctly?
I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:
Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
ResultList.Add(MatchResult.Groups("content").Value)
MatchResult = MatchResult.NextMatch()
End While
I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div
's contents never varies:
<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>
or (behold the joy of multiline strings in VB.NET):
Dim RegexObj As New Regex(
"<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
"<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
"(?<title>.*?)" & chr(10) & _
"\s*</span>\s*" & chr(10) & _
"<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
"<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
"(?<address>.*?)" & chr(10) & _
"\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
"(?<phone>.*?)" & chr(10) & _
"\s*</span>\s*</div>",
RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)
(Of course, now you need to store the results for MatchResult.Groups("title")
etc...)
Solution 2
Try using RegexOptions.Multiline
instead of RegexOptions.Singleline
Thanks to @Tim for pointing out that the above doesn't work... my bad.
@Tim's answer is a good one, and should be the accepted answer, but an extra part that is stopping your code from working is that there is no 2nd group for Group(1)
to return.
Change...
MsgBox(successfulMatch.Groups(1).ToString)
To...
MsgBox(successfulMatch.Groups(0).ToString)
Mrk Fldig
Updated on July 05, 2022Comments
-
Mrk Fldig almost 2 years
Having a bit of a brain freeze here so I was hoping for some pointers, essentially I need to extract the contents of a specific div tag, yes I know that regex usually isn't approved of for this but its a simple web scraping application where there are no nested div's.
I'm trying to match this:
<div class="entry"> <span class="title">Some company</span> <span class="description"> <strong>Address: </strong>Some address <br /><strong>Telephone: </strong> 01908 12345 </span> </div>
simple vb code is as follows:
Dim myMatches As MatchCollection Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline) Dim wc As New WebClient Dim html As String = wc.DownloadString("http://somewebaddress.com") RichTextBox1.Text = html myMatches = myRegex.Matches(html) MsgBox(html) 'Search for all the words in a string Dim successfulMatch As Match For Each successfulMatch In myMatches MsgBox(successfulMatch.Groups(1).ToString) Next
Any help would be greatly appreciated.
-
Richard almost 12 yearspossible duplicate of RegEx match open tags except XHTML self-contained tags
-
Tim Pietzcker almost 12 yearsAnd what's wrong with the regex you're having? It matches your input.
-
Mrk Fldig almost 12 yearsWell thats the odd bit its not matching anything on the entire page and theres about 20 of those div's on there
-
freefaller almost 12 yearsI know that @Tim has answered this in a much better way than I could, but for your future reference, there is no 2nd group, so
Groups(1)
(which is base-0 index) will always return an empty string... it should beGroups(0)
-
-
Tim Pietzcker almost 12 yearsCareful, this matches all div tags (not just those with
class="entry"
), and it matches everything from the very first opening<div>
to the very last closing</div>
. -
Mrk Fldig almost 12 yearsUsed <div.*?class=""entry"".*?>(?<divBody>.*)</div> - not working as Tim said it should match everything but apparently doesn't
-
Mrk Fldig almost 12 yearsYou my friend are a star! If I wanted to get each element inside that div ie the span class values id just do .*?<span<^<>]*class="title" after the closing > of the div tag?
-
freefaller almost 12 yearsThe reason I believe the original code is not picking up, is because it should be
Groups(0)
instead ofGroups(1)
-
Tim Pietzcker almost 12 years@MarcFielding: I have edited my answer: The named capturing group
(?<content>.*?)
will capture everything between thediv
s. -
Mrk Fldig almost 12 years@freefaller yeah I noticed that one I was actually using a break point and examining the match collection to see if it was picking up anything
-
Mrk Fldig almost 12 yearsI'm going to mark tim's answer as the correct one although I wouldnt mind knowing how I extract the values of each span so I pull the company name, address and phone number if your feeling energetic Tim?
-
Tim Pietzcker almost 12 years@MarcFielding: This is only (reasonably) possible if the spans are always in the same order, and it's going to be messy in any case. Regular expressions are really the wrong tool for this. For example, how can you tell when an address is over? I'll post a (brittle) example regex that will work on your example, but that will likely fail on anything that looks a bit different.
-
Mrk Fldig almost 12 yearsThanks Tim, the components within that div are always in the same order, I could always split them by ">" into an array and do some substringing but I was wondering if there was an easier way.
-
Mrk Fldig almost 12 yearsTim thats superb and saved me loads of time if I could do anything more than tick the box I would. Superb!
-
Tim Pietzcker almost 12 years@MarcFielding: Great to hear it. A suggestion: RegexBuddy is a great tool for constructing, debugging and learning regexes. Will pay for its price within days in terms of increased productivity.