Scraping a webpage with C# and HTMLAgility

35,193

Solution 1

The beginning part is off:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");   

LoadHtml(html) loads an html string into the document, I think you want something like this instead:

HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc  = htmlWeb.Load("http://stackoverflow.com");

Solution 2

A working code, according to the HTML source you provided. It can be factorized, and I'm not checking for null values (in rows, cells, and each value inside the case). If you have the page in 127.0.0.1, that will work. Just paste it inside the Main method of a Console Application and try to understand it.

HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");    

var rows = doc.DocumentNode.SelectNodes("//table[@class='data']/tr");
foreach (var row in rows)
{
    var cells = row.SelectNodes("./td");
    string title = cells[0].InnerText;
    var valueRow = cells[2];
    switch (title)
    {
        case "Part-Num":
            string partNum = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
            Console.WriteLine("Part-Num:\t" + partNum);
            break;
        case "Manu-Number":
            string manuNumber = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
            Console.WriteLine("Manu-Num:\t" + manuNumber);
            break;
        case "Description":
            string description = valueRow.InnerText;
            Console.WriteLine("Description:\t" + description);
            break;
        case "Manu-Country":
            string manuCountry = valueRow.InnerText;
            Console.WriteLine("Manu-Country:\t" + manuCountry);
            break;
        case "Last Modified":
            string lastModified = valueRow.InnerText;
            Console.WriteLine("Last Modified:\t" + lastModified);
            break;
        case "Last Modified By":
            string lastModifiedBy = valueRow.InnerText;
            Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
            break;
    }
}
Share:
35,193
JRB
Author by

JRB

Updated on April 26, 2021

Comments

  • JRB
    JRB about 3 years

    I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project. I am doing this as a C# application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags <table class="data"> and </table>.

    My goal is to pull the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified By, out of the page and send the data to a SQL table.

    One twist is that there is also a small PNG picture that also need to be grabbed from the src="/partcode/number.

    I do not have any completed code that woks. I thought this bit of code would tell me if I am heading in the right direction. Even stepping into the debug I can’t see that it does anything. Could someone possibly point me in the right direction on this. The more detailed the better since it is apparent I have a lot to learn.

    Thank you I would really appreciate it.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using HtmlAgilityPack;
    using System.Xml;
    
    namespace Stats
    {
        class PartParser
        {
            static void Main(string[] args)
            {
                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml("http://localhost");
                //My understanding this reads the entire page in?
                var tables = doc.DocumentNode.SelectNodes("//table");
                // I assume that this sets up the search for words containing table
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
                Console.ReadKey();    
            }
        }
    }
    

    The web code is:

    <!DOCTYPE html 
         PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
        <head>
            <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
            <title>Part Number Database: Item Record</title>
            <table class="data">
                <tr><td>Part-Num</td><td width="50"></td><td>
                <img src="/partcode/number/072140" alt="072140"/></td></tr>
                <tr><td>Manu-Number</td><td width="50"></td><td>
                <img src="/partcode/manu/00721408" alt="00721408" /></td></tr>    
                <tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
                <tr><td>Manu-Country</td><td></td><td>United States</td></tr>    
                <tr><td>Last Modified</td><td></td><td>26 Jan 2009,  8:08 PM</td></tr>    
                <tr><td>Last Modified By</td><td></td><td>Manu</td></tr>
            </table>
        <head/>
    </html>