Regex to extract the contents of a <div> tag

regex vb.net html

24,001

Solution 1

Your regex works for your example. There are some improvements that should be made, though:

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

[^<>]* means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.

.*? (note the ?) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry"> tag in your page.

But your regex itself should still have matched something. Perhaps you're not using it correctly?

I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Groups("content").Value)
    MatchResult = MatchResult.NextMatch()
End While

I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div's contents never varies:

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

or (behold the joy of multiline strings in VB.NET):

Dim RegexObj As New Regex(
    "<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
    "<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
    "(?<title>.*?)" & chr(10) & _
    "\s*</span>\s*" & chr(10) & _
    "<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
    "<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
    "(?<address>.*?)" & chr(10) & _
    "\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
    "(?<phone>.*?)" & chr(10) & _
    "\s*</span>\s*</div>", 
    RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

(Of course, now you need to store the results for MatchResult.Groups("title") etc...)

Solution 2

~~Try using RegexOptions.Multiline instead of RegexOptions.Singleline~~

Thanks to @Tim for pointing out that the above doesn't work... my bad.

@Tim's answer is a good one, and should be the accepted answer, but an extra part that is stopping your code from working is that there is no 2nd group for Group(1) to return.

Change...

MsgBox(successfulMatch.Groups(1).ToString)

To...

MsgBox(successfulMatch.Groups(0).ToString)

24,001

Author by

Mrk Fldig

Updated on July 05, 2022

Comments

Mrk Fldig almost 2 years
Having a bit of a brain freeze here so I was hoping for some pointers, essentially I need to extract the contents of a specific div tag, yes I know that regex usually isn't approved of for this but its a simple web scraping application where there are no nested div's.

I'm trying to match this:
```
    <div class="entry">
  <span class="title">Some company</span>
  <span class="description">
  <strong>Address: </strong>Some address
    <br /><strong>Telephone: </strong> 01908 12345
  </span>
</div>
```
simple vb code is as follows:
```
    Dim myMatches As MatchCollection
    Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
    Dim wc As New WebClient
    Dim html As String = wc.DownloadString("http://somewebaddress.com")
    RichTextBox1.Text = html
    myMatches = myRegex.Matches(html)
    MsgBox(html)
    'Search for all the words in a string
    Dim successfulMatch As Match
    For Each successfulMatch In myMatches
        MsgBox(successfulMatch.Groups(1).ToString)
    Next
```
Any help would be greatly appreciated.
- Richard almost 12 years
  
  possible duplicate of RegEx match open tags except XHTML self-contained tags
- Tim Pietzcker almost 12 years
  
  And what's wrong with the regex you're having? It matches your input.
- Mrk Fldig almost 12 years
  
  Well thats the odd bit its not matching anything on the entire page and theres about 20 of those div's on there
- freefaller almost 12 years
  
  I know that @Tim has answered this in a much better way than I could, but for your future reference, there is no 2nd group, so Groups(1) (which is base-0 index) will always return an empty string... it should be Groups(0)
Tim Pietzcker almost 12 years

Careful, this matches all div tags (not just those with class="entry"), and it matches everything from the very first opening <div> to the very last closing </div>.
Mrk Fldig almost 12 years

Used <div.*?class=""entry"".*?>(?<divBody>.*)</div> - not working as Tim said it should match everything but apparently doesn't
Mrk Fldig almost 12 years

You my friend are a star! If I wanted to get each element inside that div ie the span class values id just do .*?<span<^<>]*class="title" after the closing > of the div tag?
freefaller almost 12 years

The reason I believe the original code is not picking up, is because it should be Groups(0) instead of Groups(1)
Tim Pietzcker almost 12 years

@MarcFielding: I have edited my answer: The named capturing group (?<content>.*?) will capture everything between the divs.
Mrk Fldig almost 12 years

@freefaller yeah I noticed that one I was actually using a break point and examining the match collection to see if it was picking up anything
Mrk Fldig almost 12 years

I'm going to mark tim's answer as the correct one although I wouldnt mind knowing how I extract the values of each span so I pull the company name, address and phone number if your feeling energetic Tim?
Tim Pietzcker almost 12 years

@MarcFielding: This is only (reasonably) possible if the spans are always in the same order, and it's going to be messy in any case. Regular expressions are really the wrong tool for this. For example, how can you tell when an address is over? I'll post a (brittle) example regex that will work on your example, but that will likely fail on anything that looks a bit different.
Mrk Fldig almost 12 years

Thanks Tim, the components within that div are always in the same order, I could always split them by ">" into an array and do some substringing but I was wondering if there was an easier way.
Mrk Fldig almost 12 years

Tim thats superb and saved me loads of time if I could do anything more than tick the box I would. Superb!
Tim Pietzcker almost 12 years

@MarcFielding: Great to hear it. A suggestion: RegexBuddy is a great tool for constructing, debugging and learning regexes. Will pay for its price within days in terms of increased productivity.