Scrape text from a website using Excel VBA

20,389

You're almost there! doc.GetElementsByTagName("p") returns a collection of HTMLParagraphElement objects of which you accessed the first entry using doc.GetElementsByTagName("p")(0). As you allude to, a For Each loop would let you access each in turn:

Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim sURL As String
Dim lastrow As Long
Dim i As Integer
lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row

For i = 2 To lastrow
Set wb = CreateObject("internetExplorer.Application")
sURL = Cells(i, 1)

wb.navigate sURL
wb.Visible = True

While wb.Busy
    DoEvents
Wend

'HTML document
Set doc = wb.document

Cells(i, 2) = doc.Title

On Error GoTo err_clear

Dim el As Object
For Each el In doc.GetElementsByTagName("p")
    Cells(i, 3).Value = Cells(i, 3).Value & ", " & el.innerText
Next el

err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit
Next i

End Sub
Share:
20,389
RobbertT
Author by

RobbertT

Updated on April 25, 2020

Comments

  • RobbertT
    RobbertT about 4 years

    I found this article explaining how to scrape certain tags from a website using Excel VBA.

    The code below gets the content from the first <p> tag that it finds:

    Sub get_title_header()
    Dim wb As Object
    Dim doc As Object
    Dim sURL As String
    Dim lastrow As Long
    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    
    For i = 2 To lastrow
        Set wb = CreateObject("internetExplorer.Application")
        sURL = Cells(i, 1)
    
        wb.navigate sURL
        wb.Visible = True
    
        While wb.Busy
            DoEvents
        Wend
    
        'HTML document
        Set doc = wb.document
    
        Cells(i, 2) = doc.title
    
        On Error GoTo err_clear
        Cells(i, 3) = doc.GetElementsByTagName("p")(0).innerText
        err_clear:
        If Err <> 0 Then
            Err.Clear
            Resume Next
        End If
        wb.Quit
        Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit
    Next i
    
    End Sub
    

    I'd like to make the scraper get all the content that is within a <p> tag on a webpage. So I guess a foreach functionality of some kind is missing.

    How can the content from multiple <p> tags be collected?

    UPDATE The working code!

    Sub get_title_header()
    Dim wb As Object
    Dim doc As Object
    Dim sURL As String
    Dim lastrow As Long
    Dim i As Integer
    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    
    For i = 2 To lastrow
        Set wb = CreateObject("internetExplorer.Application")
        sURL = Cells(i, 1)
    
        wb.navigate sURL
        wb.Visible = True
    
        While wb.Busy
            DoEvents
        Wend
    
        'HTML document
        Set doc = wb.document
    
        Cells(i, 2) = doc.Title
    
        On Error GoTo err_clear
    
        Dim el As Object
        For Each el In doc.GetElementsByTagName("p")
    
            counter = counter + 1
            Cells(i, counter + 2).Value = Cells(counter + 1).Value & el.innerText
    
        Next el
        counter = 0
    
        err_clear:
        If Err <> 0 Then
            Err.Clear
            Resume Next
        End If
        wb.Quit
        Range(Cells(i, 1), Cells(i, 10)).Columns.AutoFit
    Next i
    
    End Sub
    
  • RobbertT
    RobbertT about 9 years
    Hi thanks, tried your code but it won't return anything.. Is something missing? Thanks again!
  • stucharo
    stucharo about 9 years
    @RobbertT I've copied in the code i tested. It'sa modified version of your code which doesn't have any of the worksheet references for simplicity. Also, all the output goes to the immediate window (Ctrl+G). This worked perfectly on my PC and shouldn't be to hard to modify as neccessary for your purposes
  • RobbertT
    RobbertT about 9 years
    Awesome, that worked! however still trying to figure out how to make it work with the urls in column a and the p content in column b of a excel sheet.. You could download the example on the bottom of the article that i mentioned, to see what i mean. Thanks
  • stucharo
    stucharo about 9 years
    @RobbertT Again, you were almost there. This just loops through each <p> tag and concatenates the text to the string that is already in column 3
  • RobbertT
    RobbertT about 9 years
    This is so amazingly cool, i had no idea such a thing was possible with Excel. Awesome, works like a charm. One more request; is it possible to put the content of every <p> found in a seperate cell (column) on the same row? Thaanks!
  • stucharo
    stucharo about 9 years
    @RobbertT Of course, try including a counter inside the For Each loop and using that to increment the column that you write el.innerText to.
  • RobbertT
    RobbertT about 9 years
    Hi, yes.. i figured it should be inside the for each loop. I did some searching on Google and i guess that a + 1 could do that correct? Should this be placed in the line Cells(i, 3).Value = Cells(i, 3).Value & ", " & el.innerText ?
  • stucharo
    stucharo about 9 years
    @RobbertT Yeah, you'd want to use something like counter = counter + 1 in the For Each loop then use counter to as the column argument to the Cells property i.e. Cells(i, counter + 2).Value. Remember to reset the counter back to 0 before you enter the For Each loop each time or it'll contain the last column of the previous URL. Post your modified code back into the question if you get stuck and someone should be able to help you.
  • RobbertT
    RobbertT about 9 years
    Cool, see my working code in my question now! Thanks! Could you confirm this is all correct? EDIT: See my edited question, something goes wrong.. Might be the counter reset?
  • DRC
    DRC about 3 years
    I get a "User defined type not defined" error for this line of code: Dim html As HTMLDocument
  • DRC
    DRC about 3 years
    Ah: In addition to a reference to the Microsoft XML, v6.0 library, you also have to add a reference to the Microsoft HTML Object Library