Retrieve data from a website via Visual Basic

20,820

How to scrape a website using HTMLAgilityPack (VB.Net)

I agree that htmlagilitypack is the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.

After downloading htmlagilitypack*dll, create a new application, add htmlagilitypack via nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).

The following example will extract all the values from that page within the "pricing" table. We need to know the XPath value for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...

//*[@id="pricing"]

Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.

This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...

//*[@id='pricing']/tr/td

This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...

Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[@id='pricing']/tr/td")

Next

To extract the values we simply reference our table value that was created in our loop and it's innertext member.

Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[@id='pricing']/tr/td")
    MsgBox(table.InnerText)
Next

Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.

Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.

Scraping is really that easy. That's is the basic idea. Have fun!

Share:
20,820
Jackery Xu
Author by

Jackery Xu

Updated on November 27, 2020

Comments

  • Jackery Xu
    Jackery Xu over 3 years

    There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.

    I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.

    Thanks.

  • Jackery Xu
    Jackery Xu over 11 years
    Hi there thanks for spending so much time helping me out. The only problem is that none of the tags outside of the value have any IDs so I can't really use SelectNodes.
  • MonkeyDoug
    MonkeyDoug over 11 years
    Ok here goes. As for finding a value that has no decernable identifier try this. highlight the "1" in the price break column of the page you listed in your initial question. right click and inspect element. You will see highlighted html below, right click and copy xpath. Take that and paste it into notepad, it should look like //*[@id="pricing"]/tbody/tr[2]/td[1] to be continued.
  • Jackery Xu
    Jackery Xu over 11 years
    <span itemprop="name">Assmann WSW Components</span> I want to retrieve the value, but I can't seem to use itemprop like the way you use id. I also don't see the xpath that starts with //*
  • MonkeyDoug
    MonkeyDoug over 11 years
    you will notice that the text we got has some array index notation. This is the actual location within its structure...by changing these values you get other areas of the table. Now remove the tbody since it will not return results, paste it into your select nodes and change the " to '
  • MonkeyDoug
    MonkeyDoug over 11 years
    to get the name "//span[@itemprop='name']"
  • MonkeyDoug
    MonkeyDoug over 11 years
    HtmlAgilityPack.HtmlWeb Web = new HtmlAgilityPack.HtmlWeb(); HtmlAgilityPack.HtmlDocument Doc = new HtmlAgilityPack.HtmlDocument(); Doc = Web.Load("digikey.ca/product-search/…); foreach (HtmlAgilityPack.HtmlNode table in Doc.DocumentNode.SelectNodes("//a[@itemprop='url']")) { Interaction.MsgBox(table.InnerText); }