How to work around Groovy's XmlSlurper refusing to parse HTML due to DOCTYPE and DTD restrictions?

html groovy xmlslurper

11,262

Solution 1

Tsk.

parser=new XmlSlurper()
parser.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false) 
parser.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
parser.parse(it)

Solution 2

Even though your HTML also happens to be well-formed XML, a more general solution for parsing HTML is use a true HTML parser. I've used the TagSoup parser in the past, and it handles real-world HTML quite well.

TagSoup provides a parser that implements the javax.xml.parsers.SAXParser interface and can be provided to XmlSlurper in the constructor. Example:

@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')

import org.ccil.cowan.tagsoup.Parser

def doc = new XmlSlurper(new Parser()).parse("index.html")

11,262

android.weasel

Updated on September 16, 2022

Comments

android.weasel over 1 year

I'm trying to copy an element in an HTML coverage report, so the coverage totals appear at the top of the report as well as the bottom.

The HTML starts thus and I believe is well-formed:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
    <link rel="stylesheet" href=".resources/report.css" type="text/css" />
    <link rel="shortcut icon" href=".resources/report.gif" type="image/gif" />
    <title>Unified coverage</title>
    <script type="text/javascript" src=".resources/sort.js"></script>
  </head>
  <body onload="initialSort(['breadcrumb', 'coveragetable'])">

Groovy's XmlSlurper complains as follows:

doc = new XmlSlurper( /* false, false, false */ ).parse("index.html")
[Fatal Error] index.html:1:48: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.

Enabling DOCTYPE:

doc = new XmlSlurper(false, false, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(false, true, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.


doc = new XmlSlurper(true, true, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(true, false, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

So I think I've covered all the options. There must be a way to get this working without resorting to regexps and risking the wrath of Tony The Pony.