Scraping a webpage with C# and HTMLAgility
Solution 1
The beginning part is off:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
LoadHtml(html)
loads an html string into the document, I think you want something like this instead:
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load("http://stackoverflow.com");
Solution 2
A working code, according to the HTML source you provided. It can be factorized, and I'm not checking for null values (in rows
, cells
, and each value inside the case
). If you have the page in 127.0.0.1, that will work. Just paste it inside the Main
method of a Console Application and try to understand it.
HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");
var rows = doc.DocumentNode.SelectNodes("//table[@class='data']/tr");
foreach (var row in rows)
{
var cells = row.SelectNodes("./td");
string title = cells[0].InnerText;
var valueRow = cells[2];
switch (title)
{
case "Part-Num":
string partNum = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
Console.WriteLine("Part-Num:\t" + partNum);
break;
case "Manu-Number":
string manuNumber = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
Console.WriteLine("Manu-Num:\t" + manuNumber);
break;
case "Description":
string description = valueRow.InnerText;
Console.WriteLine("Description:\t" + description);
break;
case "Manu-Country":
string manuCountry = valueRow.InnerText;
Console.WriteLine("Manu-Country:\t" + manuCountry);
break;
case "Last Modified":
string lastModified = valueRow.InnerText;
Console.WriteLine("Last Modified:\t" + lastModified);
break;
case "Last Modified By":
string lastModifiedBy = valueRow.InnerText;
Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
break;
}
}
JRB
Updated on April 26, 2021Comments
-
JRB about 3 years
I have read that
HTMLAgility 1.4
is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project. I am doing this as aC#
application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags<table class="data">
and</table>
.My goal is to pull the data for
Part-Num
,Manu-Number
,Description
,Manu-Country
,Last Modified
,Last Modified By
, out of the page and send the data to aSQL
table.One twist is that there is also a small
PNG
picture that also need to be grabbed from thesrc="/partcode/number
.I do not have any completed code that woks. I thought this bit of code would tell me if I am heading in the right direction. Even stepping into the debug I can’t see that it does anything. Could someone possibly point me in the right direction on this. The more detailed the better since it is apparent I have a lot to learn.
Thank you I would really appreciate it.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using HtmlAgilityPack; using System.Xml; namespace Stats { class PartParser { static void Main(string[] args) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml("http://localhost"); //My understanding this reads the entire page in? var tables = doc.DocumentNode.SelectNodes("//table"); // I assume that this sets up the search for words containing table } catch (Exception ex) { Console.WriteLine(ex.Message); Console.WriteLine(ex.StackTrace); Console.ReadKey(); } } }
The web code is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" /> <title>Part Number Database: Item Record</title> <table class="data"> <tr><td>Part-Num</td><td width="50"></td><td> <img src="/partcode/number/072140" alt="072140"/></td></tr> <tr><td>Manu-Number</td><td width="50"></td><td> <img src="/partcode/manu/00721408" alt="00721408" /></td></tr> <tr><td>Description</td><td></td><td>Widget 3.5</td></tr> <tr><td>Manu-Country</td><td></td><td>United States</td></tr> <tr><td>Last Modified</td><td></td><td>26 Jan 2009, 8:08 PM</td></tr> <tr><td>Last Modified By</td><td></td><td>Manu</td></tr> </table> <head/> </html>