Parse HTML content in VBA
Just a couple things that hopefully will get you in the right direction:
clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary
target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements
make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way
I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.
Sub getInfoWeb()
Dim cell As Integer
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
Dim table As MSHTML.HTMLTable
Dim tableCells As MSHTML.IHTMLElementCollection
Set xhr = New MSXML2.XMLHTTP60
For cell = 1 To 5
ItemNbr = Cells(cell, 3).Value
With xhr
.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
.send
If .readyState = 4 And .Status = 200 Then
Set doc = New MSHTML.HTMLDocument
doc.body.innerHTML = .responseText
Else
MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Set table = doc.getElementById("list-table")
Set tableCells = table.getElementsByTagName("td")
For Each tableCell In tableCells
If tableCell.getAttribute("title") = "Material" Then
Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
End If
Next tableCell
Next cell
End Sub
EDIT: as a follow-up to the further information you provided in the comment below - and the additional comments I have added
'Determine your product number
'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
'text include the "productnummer:" substring, and extract the product number from the outerstring
'OR
'if the product number consistently consists of the fctkeywords you are entering in your source url
'with two "0" appended - just build the product number like that
'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
'Load the response in an XML document, and retrieve the material information
Sub getInfoWeb()
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSXML2.DOMDocument60
Dim xmlCell As MSXML2.IXMLDOMElement
Dim xmlCells As MSXML2.IXMLDOMNodeList
Dim materialValueElement As MSXML2.IXMLDOMElement
Set xhr = New MSXML2.XMLHTTP60
With xhr
.Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
.send
If .readyState = 4 And .Status = 200 Then
Set doc = New MSXML2.DOMDocument60
doc.LoadXML .responseText
Else
MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Set xmlCells = doc.getElementsByTagName("cell")
For Each xmlCell In xmlCells
If xmlCell.Text = "Materiaal" Then
Set materialValueElement = xmlCell.NextSibling
End If
Next
MsgBox materialValueElement.Text
End Sub
EDIT2: an alternative automating IE
Sub searchWebViaIE()
Dim ie As SHDocVw.InternetExplorer
Dim doc As MSHTML.HTMLDocument
Dim anchors As MSHTML.IHTMLElementCollection
Dim anchor As MSHTML.HTMLAnchorElement
Dim prodSpec As MSHTML.HTMLAnchorElement
Dim tableCells As MSHTML.IHTMLElementCollection
Dim materialValueElement As MSHTML.HTMLTableCell
Dim tableCell As MSHTML.HTMLTableCell
Set ie = New SHDocVw.InternetExplorer
With ie
.navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
.Visible = True
Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
DoEvents
Loop
Set doc = .document
Set anchors = doc.getElementsByTagName("a")
For Each anchor In anchors
If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
anchor.Click
Exit For
End If
Next anchor
Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
DoEvents
Loop
End With
For Each anchor In anchors
If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
Set prodSpec = anchor
End If
Next anchor
Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")
If Not tableCells Is Nothing Then
For Each tableCell In tableCells
If tableCell.innerHTML = "Materiaal" Then
Set materialValueElement = tableCell.NextSibling
End If
Next tableCell
End If
MsgBox materialValueElement.innerHTML
End Sub
Tdev
Updated on January 05, 2022Comments
-
Tdev over 2 years
I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).
My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).
Here is the structure of the website:
<div style="position:relative;"> <div></div> <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;"> <tbody> <tr class="jqgfirstrow" role="row" style="height:auto"> <td ...</td> <td ...</td> </tr> <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td> <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td> </tr> <tr ...> </tr> </tbody> </table> </div>
I would like to get "600D Polyester" as a result.
My (not working) code snippet is as is:
Sub ParseMaterial() Dim Cell As Integer Dim ItemNbr As String Dim AElement As Object Dim AElements As IHTMLElementCollection Dim IE As MSXML2.XMLHTTP60 Set IE = New MSXML2.XMLHTTP60 Dim HTMLDoc As MSHTML.HTMLDocument Dim HTMLBody As MSHTML.HTMLBody Set HTMLDoc = New MSHTML.HTMLDocument Set HTMLBody = HTMLDoc.body For Cell = 1 To 5 'I iterate through the file row by row ItemNbr = Cells(Cell, 3).Value 'ItemNbr isin the 3rd Column of my spreadsheet IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False IE.send While IE.ReadyState <> 4 DoEvents Wend HTMLBody.innerHTML = IE.responseText Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr") For Each AElement In AElements If AElement.Title = "Material" Then Cells(Cell, 14) = AElement.nextNode.value 'I write the material in the 14th column End If Next AElement Application.Wait (Now + TimeValue("0:00:2")) Next Cell
Thanks for your help !
-
Tdev over 9 yearsThanks again for your answer IAmDranged! Unfortunately this time it doesn't provide the expected output. Here is the considered website with a specific product: pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/… These errors occur: * "Object variable not set (Error 91) " at the line "Set tableCells = table.getElementsByTagName("td")" * "Type mismatch (Error 13)" at the line "For Each tableCell In tds" I tried to replace 'table.getElementsByName("td")' by 'doc.' and 'tds' by 'tableCells'. It then run without error but nothing happens.
-
IAmDranged over 9 yearsThis is because the table doens't exist in the html source - so table returns Nothing. The table is actually requested and returned as an aside xml document when the url is loaded into a browser - which would be sourced from this url in the specific example you are taking above pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/…. This url looks to take a product id, standard paramters, and a seemingly redundant "nd" parameter. You can target this url with proper parameters instead of the source url
-
IAmDranged over 9 yearsI have added some more code above to exemplify how you can work your way around - hope this helps
-
Tdev over 9 yearsIt perfectly works ! I tried to trigger the "Onclick" event but without success. I didn't notice that the website could be opened this way. Thanks a lot IAmDranged!
-
IAmDranged over 9 yearsNo problem. To be able to properly manage the Onclick event, you need to be working in an adequate environment - like a browser-like environment. You can do this by automating IE for instance - I have added some more code showing how this can be done
-
Tdev over 9 yearsThank you I noticed that I have to 'browse' the website ith IE prior to fire this event. It works smoothly the only matter is about the language. When I'm fetching different items through my spreadsheet (on the same website, with the same logic) the language changes from item to another and then 'Material' changes to 'Materiaal' (dutch) or 'Matériau' (french). I tried to add "&world=general" in the URL but it doesn't work. Do you have any idea on how to 'fix' the language to English ?
-
Dani Aya over 7 yearsYou will need reference Microsoft XML, v 6.0 in order to use XMLHTTP60 objects.