Using XmlSlurper: How to select sub-elements while iterating over a GPathResult

html parsing groovy xmlslurper

11,096

Solution 1

Replace grep with find:

html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

then you'll get

#href1: Here is the addressTelephone number: telephone

#href2: Here is another addressAnother telephone: 0845 1111111

grep returns an ArrayList but find returns a NodeChild class:

println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()

results in:

class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild

thus if you wanted to use grep you could then nest another each like this for it to work

html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
    it.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
}

Long story short, in your case, use find rather than grep.

Solution 2

This was is a tricky one. When there is just one element with class='divclass' the previous answer sure is fine. If there were multiple results from grep, then a find() for a single result is not the answer. Pointing out that the result is an ArrayList is correct. Inserting an outer nested .each() loop provides a GPathResult in the closure parameter div. From here the drill down can continue with the expected result.

html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address.text()
   println "$link: $address\n"
}}

The behavior of the original code can use a bit more of an explanation as well. When a property is accessed on a List in Groovy, you'll get a new list (same size) with the property of each element in the list. The list found by grep() has just one entry. Then we get one entry for property ol, which is fine. Next we get the result of ol.it for that entry. It is a list of size() == 1 again, but this time with an entry of size() == 2. We could apply the outer loop there and get the same result, if we wanted to:

html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address
   println "$link: $address\n"
}}

On any GPathResult representing multiple nodes, we get the concatenation of all text. That is the original result, first for @href, then for address.

11,096

Andrew Whitehouse

Java, Groovy, Scala and Clojure developer. I enjoy writing useful software, in the simplest way possible. Interested in other things too.

Updated on April 15, 2022

Comments

Andrew Whitehouse about 2 years

I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.

Here's the generalised code:

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.

Can you let me know: