How to parse non-well formatted HTML with XmlSlurper
Solution 1
With the following piece of code it's getting parsed well (without errors):
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def parser = new SAXParser()
def page = new XmlSlurper(parser).parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')
However I don't know which elements exactly You'd like to find.
Here All mirrors
are found:
page.depthFirst().find {
it.text() == 'All mirrors'
}.@href
EDIT
Both outputs are null
.
println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}
EDIT 2
Below You can find a working example that downloads the file and parses it correctly. I used wget
to download the file (there's something wrong with downloading it with groovy - don't know what)
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def temp = File.createTempFile('eclipse', 'tmp')
temp.deleteOnExit()
def cmd = ['wget', host, '-O', temp.absolutePath].execute()
cmd.waitFor()
cmd.exitValue()
def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(temp.text)
println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}
EDIT 3
And finally problem solved. Using groovy's url.toURL().text
causes problems when no User-Agent
header is specified. Now it works correctly and elements are found - no external tools used.
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(host.toURL().getText(requestProperties: ['User-Agent': 'Non empty']))
assert page.depthFirst().find { it.text() == 'North America'}
assert page.depthFirst().find { it.text().contains('North America')}
Solution 2
I am fond of the tagsoup SAX parser, which says it's designed to parse "poor, nasty and brutish" HTML.
It can be used in conjunction with XmlSlurper
quite easily:
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2')
def parser = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
def page = parser.parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')
println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}
This results in non-null output.
allprog
I'm #SOreadytohelp I love beautiful code. Be it Java, Objective-c, Javascript or anything. Things that I'm particularly interested in nowadays: Bluetooth programming, especially BLE Core Bluetooth Bluetooth Classic on Android Functional Reactive Programming Reactive Cocoa RxJava iOS programming in general, I'm always looking for neat stuff
Updated on June 07, 2022Comments
-
allprog almost 2 years
I'm trying to parse a non-well-formatted HTML page with XmlSlurper, the Eclipse download site The W3C validator shows several errors in the page.
I tried the fault-tolerant parser from this post
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') import org.cyberneko.html.parsers.SAXParser import groovy.util.XmlSlurper // Getting the xhtml page thanks to Neko SAX parser def mirrors = new XmlSlurper(new SAXParser()).parse("http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz") mirrors.'**'
Unfortunately, it looks like not all content is parsed into the XML object. The faulty subtrees are simply ignored.
E.g.
page.depthFirst().find { it.text() == 'North America'}
returnsnull
instead of the H4 element in the page.Is there some robust way to parse any HTML content in groovy?