Parse HTML content in VBA

79,794

Just a couple things that hopefully will get you in the right direction:

  • clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary

  • target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements

  • make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way

I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.

Sub getInfoWeb()

    Dim cell As Integer
    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSHTML.HTMLDocument
    Dim table As MSHTML.HTMLTable
    Dim tableCells As MSHTML.IHTMLElementCollection
    
    Set xhr = New MSXML2.XMLHTTP60
   
    For cell = 1 To 5
        
        ItemNbr = Cells(cell, 3).Value
        
        With xhr
        
            .Open "GET", "http://www.example.com/?item=" & ItemNbr, False
            .send
            
            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSHTML.HTMLDocument
                doc.body.innerHTML = .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If
            
        End With
        
        Set table = doc.getElementById("list-table")
        Set tableCells = table.getElementsByTagName("td")
        
        For Each tableCell In tableCells
            If tableCell.getAttribute("title") = "Material" Then
                Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
            End If
        Next tableCell
    
    Next cell
    
End Sub

EDIT: as a follow-up to the further information you provided in the comment below - and the additional comments I have added

'Determine your product number
    'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
    'text include the "productnummer:" substring, and extract the product number from the outerstring
    'OR
    'if the product number consistently consists of the fctkeywords you are entering in your source url
    'with two "0" appended - just build the product number like that
'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
'Load the response in an XML document, and retrieve the material information

Sub getInfoWeb()

    Dim xhr As MSXML2.XMLHTTP60
    Dim doc As MSXML2.DOMDocument60
    Dim xmlCell As MSXML2.IXMLDOMElement
    Dim xmlCells As MSXML2.IXMLDOMNodeList
    Dim materialValueElement As MSXML2.IXMLDOMElement
    
    Set xhr = New MSXML2.XMLHTTP60
        
        With xhr
            
            .Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
            .send
            
            If .readyState = 4 And .Status = 200 Then
                Set doc = New MSXML2.DOMDocument60
                doc.LoadXML .responseText
            Else
                MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            End If
            
        End With
        
        Set xmlCells = doc.getElementsByTagName("cell")

        For Each xmlCell In xmlCells
            If xmlCell.Text = "Materiaal" Then
                Set materialValueElement = xmlCell.NextSibling
            End If
        Next
        
        MsgBox materialValueElement.Text
    
End Sub

EDIT2: an alternative automating IE

Sub searchWebViaIE()
    Dim ie As SHDocVw.InternetExplorer
    Dim doc As MSHTML.HTMLDocument
    Dim anchors As MSHTML.IHTMLElementCollection
    Dim anchor As MSHTML.HTMLAnchorElement
    Dim prodSpec As MSHTML.HTMLAnchorElement
    Dim tableCells As MSHTML.IHTMLElementCollection
    Dim materialValueElement As MSHTML.HTMLTableCell
    Dim tableCell As MSHTML.HTMLTableCell
    
    Set ie = New SHDocVw.InternetExplorer
    
    With ie
        .navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
        .Visible = True
        
        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop
        
        Set doc = .document
        
        Set anchors = doc.getElementsByTagName("a")
        
        For Each anchor In anchors
            If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
                anchor.Click
                Exit For
            End If
        Next anchor
        
        Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
            DoEvents
        Loop
    
    End With
        
    For Each anchor In anchors
        If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
            Set prodSpec = anchor
        End If
    Next anchor
    
    Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")
    
    If Not tableCells Is Nothing Then
        For Each tableCell In tableCells
            If tableCell.innerHTML = "Materiaal" Then
                Set materialValueElement = tableCell.NextSibling
            End If
        Next tableCell
    End If
    
    MsgBox materialValueElement.innerHTML
    
End Sub
Share:
79,794
Tdev
Author by

Tdev

Updated on January 05, 2022

Comments

  • Tdev
    Tdev over 2 years

    I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).

    My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).

    Here is the structure of the website:

    <div style="position:relative;">
        <div></div>
        <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">
            <tbody>
                <tr class="jqgfirstrow" role="row" style="height:auto">
                    <td ...</td>
                    <td ...</td>
                </tr>
                <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                    <td ...</td>
                    <td ...</td>
                </tr>
                <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                    <td ...</td>
                    <td ...</td>
                </tr>
                <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                    <td ...</td>
                    <td ...</td>
                </tr>
                <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                    <td ...</td>
                    <td ...</td>
                </tr>
                <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                    <td ...</td>
                    <td ...</td>
                </tr>
                <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                    <td ...</td>
                    <td ...</td>
                </tr>
                <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                    <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>
                    <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>
                </tr>           
                <tr ...>
                </tr>
            </tbody>
        </table> </div>
    

    I would like to get "600D Polyester" as a result.

    My (not working) code snippet is as is:

    Sub ParseMaterial()
    
        Dim Cell As Integer
        Dim ItemNbr As String
    
        Dim AElement As Object
        Dim AElements As IHTMLElementCollection
    Dim IE As MSXML2.XMLHTTP60
    Set IE = New MSXML2.XMLHTTP60
    
    Dim HTMLDoc As MSHTML.HTMLDocument
    Dim HTMLBody As MSHTML.HTMLBody
    
    Set HTMLDoc = New MSHTML.HTMLDocument
    Set HTMLBody = HTMLDoc.body
    
    For Cell = 1 To 5                            'I iterate through the file row by row
    
        ItemNbr = Cells(Cell, 3).Value           'ItemNbr isin the 3rd Column of my spreadsheet
    
        IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
        IE.send
    
        While IE.ReadyState <> 4
            DoEvents
        Wend
    
        HTMLBody.innerHTML = IE.responseText
    
        Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")
        For Each AElement In AElements
            If AElement.Title = "Material" Then
                Cells(Cell, 14) = AElement.nextNode.value     'I write the material in the 14th column
            End If
        Next AElement
    
            Application.Wait (Now + TimeValue("0:00:2"))
    
    Next Cell
    

    Thanks for your help !

  • Tdev
    Tdev over 9 years
    Thanks again for your answer IAmDranged! Unfortunately this time it doesn't provide the expected output. Here is the considered website with a specific product: pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/… These errors occur: * "Object variable not set (Error 91) " at the line "Set tableCells = table.getElementsByTagName("td")" * "Type mismatch (Error 13)" at the line "For Each tableCell In tds" I tried to replace 'table.getElementsByName("td")' by 'doc.' and 'tds' by 'tableCells'. It then run without error but nothing happens.
  • IAmDranged
    IAmDranged over 9 years
    This is because the table doens't exist in the html source - so table returns Nothing. The table is actually requested and returned as an aside xml document when the url is loaded into a browser - which would be sourced from this url in the specific example you are taking above pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/…. This url looks to take a product id, standard paramters, and a seemingly redundant "nd" parameter. You can target this url with proper parameters instead of the source url
  • IAmDranged
    IAmDranged over 9 years
    I have added some more code above to exemplify how you can work your way around - hope this helps
  • Tdev
    Tdev over 9 years
    It perfectly works ! I tried to trigger the "Onclick" event but without success. I didn't notice that the website could be opened this way. Thanks a lot IAmDranged!
  • IAmDranged
    IAmDranged over 9 years
    No problem. To be able to properly manage the Onclick event, you need to be working in an adequate environment - like a browser-like environment. You can do this by automating IE for instance - I have added some more code showing how this can be done
  • Tdev
    Tdev over 9 years
    Thank you I noticed that I have to 'browse' the website ith IE prior to fire this event. It works smoothly the only matter is about the language. When I'm fetching different items through my spreadsheet (on the same website, with the same logic) the language changes from item to another and then 'Material' changes to 'Materiaal' (dutch) or 'Matériau' (french). I tried to add "&world=general" in the URL but it doesn't work. Do you have any idea on how to 'fix' the language to English ?
  • Dani Aya
    Dani Aya over 7 years
    You will need reference Microsoft XML, v 6.0 in order to use XMLHTTP60 objects.