Regex to extract the contents of a <div> tag

24,001

Solution 1

Your regex works for your example. There are some improvements that should be made, though:

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

[^<>]* means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.

.*? (note the ?) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry"> tag in your page.

But your regex itself should still have matched something. Perhaps you're not using it correctly?

I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Groups("content").Value)
    MatchResult = MatchResult.NextMatch()
End While

I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div's contents never varies:

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

or (behold the joy of multiline strings in VB.NET):

Dim RegexObj As New Regex(
    "<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
    "<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
    "(?<title>.*?)" & chr(10) & _
    "\s*</span>\s*" & chr(10) & _
    "<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
    "<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
    "(?<address>.*?)" & chr(10) & _
    "\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
    "(?<phone>.*?)" & chr(10) & _
    "\s*</span>\s*</div>", 
    RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

(Of course, now you need to store the results for MatchResult.Groups("title") etc...)

Solution 2

Try using RegexOptions.Multiline instead of RegexOptions.Singleline

Thanks to @Tim for pointing out that the above doesn't work... my bad.

@Tim's answer is a good one, and should be the accepted answer, but an extra part that is stopping your code from working is that there is no 2nd group for Group(1) to return.

Change...

MsgBox(successfulMatch.Groups(1).ToString)

To...

MsgBox(successfulMatch.Groups(0).ToString)
Share:
24,001
Mrk Fldig
Author by

Mrk Fldig

Updated on July 05, 2022

Comments

  • Mrk Fldig
    Mrk Fldig almost 2 years

    Having a bit of a brain freeze here so I was hoping for some pointers, essentially I need to extract the contents of a specific div tag, yes I know that regex usually isn't approved of for this but its a simple web scraping application where there are no nested div's.

    I'm trying to match this:

        <div class="entry">
      <span class="title">Some company</span>
      <span class="description">
      <strong>Address: </strong>Some address
        <br /><strong>Telephone: </strong> 01908 12345
      </span>
    </div>
    

    simple vb code is as follows:

        Dim myMatches As MatchCollection
        Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
        Dim wc As New WebClient
        Dim html As String = wc.DownloadString("http://somewebaddress.com")
        RichTextBox1.Text = html
        myMatches = myRegex.Matches(html)
        MsgBox(html)
        'Search for all the words in a string
        Dim successfulMatch As Match
        For Each successfulMatch In myMatches
            MsgBox(successfulMatch.Groups(1).ToString)
        Next
    

    Any help would be greatly appreciated.

    • Richard
      Richard almost 12 years
    • Tim Pietzcker
      Tim Pietzcker almost 12 years
      And what's wrong with the regex you're having? It matches your input.
    • Mrk Fldig
      Mrk Fldig almost 12 years
      Well thats the odd bit its not matching anything on the entire page and theres about 20 of those div's on there
    • freefaller
      freefaller almost 12 years
      I know that @Tim has answered this in a much better way than I could, but for your future reference, there is no 2nd group, so Groups(1) (which is base-0 index) will always return an empty string... it should be Groups(0)
  • Tim Pietzcker
    Tim Pietzcker almost 12 years
    Careful, this matches all div tags (not just those with class="entry"), and it matches everything from the very first opening <div> to the very last closing </div>.
  • Mrk Fldig
    Mrk Fldig almost 12 years
    Used <div.*?class=""entry"".*?>(?<divBody>.*)</div> - not working as Tim said it should match everything but apparently doesn't
  • Mrk Fldig
    Mrk Fldig almost 12 years
    You my friend are a star! If I wanted to get each element inside that div ie the span class values id just do .*?<span<^<>]*class="title" after the closing > of the div tag?
  • freefaller
    freefaller almost 12 years
    The reason I believe the original code is not picking up, is because it should be Groups(0) instead of Groups(1)
  • Tim Pietzcker
    Tim Pietzcker almost 12 years
    @MarcFielding: I have edited my answer: The named capturing group (?<content>.*?) will capture everything between the divs.
  • Mrk Fldig
    Mrk Fldig almost 12 years
    @freefaller yeah I noticed that one I was actually using a break point and examining the match collection to see if it was picking up anything
  • Mrk Fldig
    Mrk Fldig almost 12 years
    I'm going to mark tim's answer as the correct one although I wouldnt mind knowing how I extract the values of each span so I pull the company name, address and phone number if your feeling energetic Tim?
  • Tim Pietzcker
    Tim Pietzcker almost 12 years
    @MarcFielding: This is only (reasonably) possible if the spans are always in the same order, and it's going to be messy in any case. Regular expressions are really the wrong tool for this. For example, how can you tell when an address is over? I'll post a (brittle) example regex that will work on your example, but that will likely fail on anything that looks a bit different.
  • Mrk Fldig
    Mrk Fldig almost 12 years
    Thanks Tim, the components within that div are always in the same order, I could always split them by ">" into an array and do some substringing but I was wondering if there was an easier way.
  • Mrk Fldig
    Mrk Fldig almost 12 years
    Tim thats superb and saved me loads of time if I could do anything more than tick the box I would. Superb!
  • Tim Pietzcker
    Tim Pietzcker almost 12 years
    @MarcFielding: Great to hear it. A suggestion: RegexBuddy is a great tool for constructing, debugging and learning regexes. Will pay for its price within days in terms of increased productivity.