Scrape text from a website using Excel VBA
20,389
You're almost there! doc.GetElementsByTagName("p")
returns a collection of HTMLParagraphElement
objects of which you accessed the first entry using doc.GetElementsByTagName("p")(0)
. As you allude to, a For Each
loop would let you access each in turn:
Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim sURL As String
Dim lastrow As Long
Dim i As Integer
lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
For i = 2 To lastrow
Set wb = CreateObject("internetExplorer.Application")
sURL = Cells(i, 1)
wb.navigate sURL
wb.Visible = True
While wb.Busy
DoEvents
Wend
'HTML document
Set doc = wb.document
Cells(i, 2) = doc.Title
On Error GoTo err_clear
Dim el As Object
For Each el In doc.GetElementsByTagName("p")
Cells(i, 3).Value = Cells(i, 3).Value & ", " & el.innerText
Next el
err_clear:
If Err <> 0 Then
Err.Clear
Resume Next
End If
wb.Quit
Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit
Next i
End Sub
Author by
RobbertT
Updated on April 25, 2020Comments
-
RobbertT about 4 years
I found this article explaining how to scrape certain tags from a website using Excel VBA.
The code below gets the content from the first
<p>
tag that it finds:Sub get_title_header() Dim wb As Object Dim doc As Object Dim sURL As String Dim lastrow As Long lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row For i = 2 To lastrow Set wb = CreateObject("internetExplorer.Application") sURL = Cells(i, 1) wb.navigate sURL wb.Visible = True While wb.Busy DoEvents Wend 'HTML document Set doc = wb.document Cells(i, 2) = doc.title On Error GoTo err_clear Cells(i, 3) = doc.GetElementsByTagName("p")(0).innerText err_clear: If Err <> 0 Then Err.Clear Resume Next End If wb.Quit Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit Next i End Sub
I'd like to make the scraper get all the content that is within a
<p>
tag on a webpage. So I guess aforeach
functionality of some kind is missing.How can the content from multiple
<p>
tags be collected?UPDATE The working code!
Sub get_title_header() Dim wb As Object Dim doc As Object Dim sURL As String Dim lastrow As Long Dim i As Integer lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row For i = 2 To lastrow Set wb = CreateObject("internetExplorer.Application") sURL = Cells(i, 1) wb.navigate sURL wb.Visible = True While wb.Busy DoEvents Wend 'HTML document Set doc = wb.document Cells(i, 2) = doc.Title On Error GoTo err_clear Dim el As Object For Each el In doc.GetElementsByTagName("p") counter = counter + 1 Cells(i, counter + 2).Value = Cells(counter + 1).Value & el.innerText Next el counter = 0 err_clear: If Err <> 0 Then Err.Clear Resume Next End If wb.Quit Range(Cells(i, 1), Cells(i, 10)).Columns.AutoFit Next i End Sub
-
RobbertT about 9 yearsHi thanks, tried your code but it won't return anything.. Is something missing? Thanks again!
-
stucharo about 9 years@RobbertT I've copied in the code i tested. It'sa modified version of your code which doesn't have any of the worksheet references for simplicity. Also, all the output goes to the immediate window (
Ctrl+G
). This worked perfectly on my PC and shouldn't be to hard to modify as neccessary for your purposes -
RobbertT about 9 yearsAwesome, that worked! however still trying to figure out how to make it work with the urls in column a and the p content in column b of a excel sheet.. You could download the example on the bottom of the article that i mentioned, to see what i mean. Thanks
-
stucharo about 9 years@RobbertT Again, you were almost there. This just loops through each
<p>
tag and concatenates the text to the string that is already in column 3 -
RobbertT about 9 yearsThis is so amazingly cool, i had no idea such a thing was possible with Excel. Awesome, works like a charm. One more request; is it possible to put the content of every <p> found in a seperate cell (column) on the same row? Thaanks!
-
stucharo about 9 years@RobbertT Of course, try including a counter inside the
For Each
loop and using that to increment the column that you writeel.innerText
to. -
RobbertT about 9 yearsHi, yes.. i figured it should be inside the for each loop. I did some searching on Google and i guess that a
+ 1
could do that correct? Should this be placed in the lineCells(i, 3).Value = Cells(i, 3).Value & ", " & el.innerText
? -
stucharo about 9 years@RobbertT Yeah, you'd want to use something like
counter = counter + 1
in theFor Each
loop then usecounter
to as the column argument to theCells
property i.e.Cells(i, counter + 2).Value
. Remember to reset the counter back to 0 before you enter theFor Each
loop each time or it'll contain the last column of the previous URL. Post your modified code back into the question if you get stuck and someone should be able to help you. -
RobbertT about 9 yearsCool, see my working code in my question now! Thanks! Could you confirm this is all correct? EDIT: See my edited question, something goes wrong.. Might be the counter reset?
-
DRC about 3 yearsI get a "User defined type not defined" error for this line of code: Dim html As HTMLDocument
-
DRC about 3 yearsAh: In addition to a reference to the Microsoft XML, v6.0 library, you also have to add a reference to the Microsoft HTML Object Library