XPath/HtmlAgilityPack: How to find an element (a) with a specific value for an attribute (href) and find adjacent table columns?

c# html visual-studio xpath html-agility-pack

13,585

Solution 1

Use the following XPath expressions:

   /*/tr/td[a[@href='url-a']]
                /following-sibling::td[1]
                     /a/text()

When evaluated against the provided (malformed but corrected) XML document:

<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>

the wanted text node is selected:

id A

Similarly, this XPath expression:

   /*/tr/td[a[@href='url-a']]
                /following-sibling::td[2]
                     /a/text()

when evaluated against the same XML document (above), selects the other wanted text node:

img A

XSLT-based verification:

When this transformation is applied on the XML document (above):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/tr/td[a[@href='url-a']]
                /following-sibling::td[1]
                     /a/text()"/>

  <xsl:text>&#10;</xsl:text>
  <xsl:copy-of select=
   "/*/tr/td[a[@href='url-a']]
                /following-sibling::td[2]
                     /a/text()"/>
 </xsl:template>
</xsl:stylesheet>

the wanted results are produced:

id A
img A

Solution 2

You have a seriously broken HTML with unmatching closing td tags. Fix them please. It's just an ugly picture this markup.

This being said hopefully Html Agility Pack can handle any crap that you throw at it, so here's how to proceed and parse the junk you have and find the id and img values given the href:

class Program
{
    static void Main()
    {
        var doc = new HtmlDocument();
        doc.Load("test.html");
        var anchor = doc.DocumentNode.SelectSingleNode("//a[contains(@href, 'url-a')]");
        if (anchor != null)
        {
            var id = anchor.ParentNode.SelectSingleNode("following-sibling::td/a");
            if (id != null)
            {
                Console.WriteLine(id.InnerHtml);
                var img = id.ParentNode.SelectSingleNode("following-sibling::td/a");
                if (img != null)
                {
                    Console.WriteLine(img.InnerHtml);
                }
            }
        }
    }
}

13,585

Gernony

Updated on June 04, 2022

Comments

Gernony almost 2 years
I'm pretty desperate because I can't figure out how to achieve what I stated in the question. I've already read countless of similar examples but didn't find one which works in exact situation. So, let's say I have the following code:
```
<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>
```
Now, what I already have is a part of url-a. I basically want to know how I can get id A and img A. I'm trying to "find" the line with XPath but I can't work out a way to make it work. Also, it might be possible that the information is not present at all. This is my latest try (seriously, I've tinkered with this for more than 3 hours now trying numerous different ways):
```
if (htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]") != null)
    string ida = htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]/following-sibling::a").InnerText;
```
Well, it's apparently wrong as hell so I'd be very pleased if someone could help me out here. Also I'd appreciate it if someone could point me to some Website which explains XPath and the notations/Syntax in detail with examples like this one. Books also welcome.

PS: I know I could achieve my goal without XPath at all too with Regex or just a simple StreamReader in C# and checking if each line contains what I need but a) it's too fragile for my needs because the code might have abrupt line-breaks and b) I really want to stay consistend with sticking completely to XPath for anything I'm doing in this project.

Thanks in advance for your help!
- Dimitre Novatchev over 12 years
  
  Good question, +1. See my answer for the exact XPath expressions that select the wanted text nodes.
Dimitre Novatchev over 12 years

@_Darin Dimitrov: The wanted text nodes can be selected with a single XPath expression (irrespective of the programming language that is hosting XPath) -- see my answer.
Darin Dimitrov over 12 years

@Dimitre Novatchev, wow, you are a real XPath guru :-) That's really great. It looks like Chineese to me but if it works it's really nice.
Dimitre Novatchev over 12 years

@_Darin Dimitrov: Yes, it works, as demonstrated by the accompanying XSLT-based verification. While XPath is ellegant and powerful, it isn't especially difficult. You might be interested in my XPath Visualizer, which I wrote years ago. It has helped many thousands of programmers learn XPath the fun way - just by playing with different XPath expressions and incrementally improving their results. Link: huttar.net/dimitre/XPV/TopXML-XPV.html
Darin Dimitrov over 12 years

@Dimitre Novatchev, while it looks interesting, XPath is not something that I use in my everyday code. I prefer to avoid it due to my ignorance of it :-) This being said, I really admire XPath gurus, the same way I admire Regex gurus. Never really grasped those notions. I have only basic understanding of them and when needed I prefer to use some full blown parser that will do the job for me and avoid having to write and especially maintain code that contains them.
Darin Dimitrov over 12 years

By the way I have just tested your XPath expressions and they work. Hat down to you.
Dimitre Novatchev over 12 years

@_Darin Dimitrov: You are welcome. Should you ever need to understand or construct XPath expressions, don't hezitate to ask me.
Darin Dimitrov over 12 years

@Dimitre Novatchev, thanks, I won't hesitate. I know who are the gurus on SO in the different fields.
Gernony over 12 years

Okay uhm I'll try what you wrote. The code above is not the "junk" I have but just an example I quickly wrote to save you guys reading through some huge code.
Gernony over 12 years

I'll try your solution as well and report back how it worked out. Thanks.
Gernony over 12 years

Ok, I needed to do a few adjustments (e.g. because I only had a part of the url and not the full one to match) but all in all it worked like a charm! Thanks a lot. Not only did it help me with this issue but I finally understood how complex XPath Syntax actually works in practice. I'll also take a look at your XPath Visualizer, guess that'll be exactly what I need :-)