HTML Agility Pack Select Nodes

47,210

Your first problem with the commented out SelectNodes doesn't work because 'id' is not an element name, it's an attribute name. You've used the correct syntax in your other expressions for selecting an attribute and comparing the value. Eg, //ElementName[@attributeName='value']. I think even [attributeName='value'] should work, but I have not tested this.

The syntax inside the SelectNodes function is called "XPath". This link might help you out.

The seller node you are selecting is a sibling of node for the current iteration that is an img with an alt attribute. However I think the correct syntax you want is just img[@alt].

The next problem where you say it won't compile, check the error message, it will probably be complaining back argument types. sellers.Add I think is looking to name another HtmlNode, not an attribute which is what the expression inside the add is returning.

Also, check out the Html Agility pack docs and other questions regarding syntax.

Share:
47,210
Reg
Author by

Reg

Updated on October 22, 2020

Comments

  • Reg
    Reg over 3 years

    I am trying to use the HTML Agility pack to scrape some data from a site. I am really struggling in figuring out how to use selectnodes inside a foreach and then exporting the data to a list or array.

    Here is the code I am working with so far.

           string result = string.Empty;
    
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(http://www.amazon.com/gp/offer-listing/B002UYSHMM/);
            request.Method = "GET";
    
            using (var stream = request.GetResponse().GetResponseStream())
            using (var reader = new StreamReader(stream, Encoding.UTF8))
            {
                result = reader.ReadToEnd();
            }
    
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.Load(new StringReader(result));
            HtmlNode root = doc.DocumentNode;
    
            string itemdesc = doc.DocumentNode.SelectSingleNode("//h1[@class='producttitle']").InnerText;  //this works perfectly to get the title of the item
            //HtmlNodeCollection sellers = doc.DocumentNode.SelectNodes("//id['bucketnew']/div/table/tbody/tr/td/ul/a/img/@alt");//this does not work at all in getting the alt attribute from the seller images
            HtmlNodeCollection prices = doc.DocumentNode.SelectNodes("//span[@class='price']"); //this works fine getting the prices
            HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@class='resultsset']/table/tbody[@class='result']/tr"); //this is the code I am working on to try to collect each tr in the result.  I then want to eather add each span.price to a list from this and also add each alt attribute from the seller image to a list.  Once I get this working I will want to use an if statement in the case that there is text for the seller name instead of an image.
    
            List<string> sellers = new List<string>();
            List<string> prices = new List<string>();
    
            foreach (HtmlNode node in nodes)
            {
                HtmlNode seller = node.SelectSingleNode(".//img/@alt");  // I am not sure if this works
                sellers.Add(seller.SelectSingleNode("img").Attributes["alt"]); //this definitly does not work and will not compile.
    
            }
    

    I have comments in the code above showing what works and what doesn't and sort of what I want to accomplish.

    If anyone has any sugguestions or reading that would be great! I have been searching forums and examples and have not come accross anything that I can use.