How to parse non-well formatted HTML with XmlSlurper

12,075

Solution 1

With the following piece of code it's getting parsed well (without errors):

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def parser = new SAXParser()
def page = new XmlSlurper(parser).parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')

However I don't know which elements exactly You'd like to find.

Here All mirrors are found:

page.depthFirst().find { 
    it.text() == 'All mirrors'
}.@href

EDIT

Both outputs are null.

println page.depthFirst().find { it.text() == 'North America'}

println page.depthFirst().find { it.text().contains('North America')}

EDIT 2

Below You can find a working example that downloads the file and parses it correctly. I used wget to download the file (there's something wrong with downloading it with groovy - don't know what)

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def temp = File.createTempFile('eclipse', 'tmp')
temp.deleteOnExit()

def cmd = ['wget', host, '-O', temp.absolutePath].execute()
cmd.waitFor()
cmd.exitValue()

def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(temp.text)

println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}

EDIT 3

And finally problem solved. Using groovy's url.toURL().text causes problems when no User-Agent header is specified. Now it works correctly and elements are found - no external tools used.

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'

def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(host.toURL().getText(requestProperties: ['User-Agent': 'Non empty']))

assert page.depthFirst().find { it.text() == 'North America'}
assert page.depthFirst().find { it.text().contains('North America')}

Solution 2

I am fond of the tagsoup SAX parser, which says it's designed to parse "poor, nasty and brutish" HTML.

It can be used in conjunction with XmlSlurperquite easily:

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2')
def parser = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())

def page = parser.parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')

println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}    

This results in non-null output.

Share:
12,075
allprog
Author by

allprog

I'm #SOreadytohelp I love beautiful code. Be it Java, Objective-c, Javascript or anything. Things that I'm particularly interested in nowadays: Bluetooth programming, especially BLE Core Bluetooth Bluetooth Classic on Android Functional Reactive Programming Reactive Cocoa RxJava iOS programming in general, I'm always looking for neat stuff

Updated on June 07, 2022

Comments

  • allprog
    allprog almost 2 years

    I'm trying to parse a non-well-formatted HTML page with XmlSlurper, the Eclipse download site The W3C validator shows several errors in the page.

    I tried the fault-tolerant parser from this post

    @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
    import org.cyberneko.html.parsers.SAXParser 
    import groovy.util.XmlSlurper
    
    // Getting the xhtml page thanks to Neko SAX parser 
    def mirrors = new XmlSlurper(new SAXParser()).parse("http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz")    
    
    mirrors.'**'
    

    Unfortunately, it looks like not all content is parsed into the XML object. The faulty subtrees are simply ignored.

    E.g. page.depthFirst().find { it.text() == 'North America'} returns null instead of the H4 element in the page.

    Is there some robust way to parse any HTML content in groovy?