How to search for comments ("<!-- -->") using Jsoup?
Solution 1
When searching you basically use Elements.select(selector)
where selector
is defined by this API. However comments are not elements technically, so you may be confused here, still they are nodes identified by the node name #comment
.
Let's see how that might work:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
public class RemoveComments {
public static void main(String... args) {
String h = "<html><head></head><body>" +
"<div><!-- foo --><p>bar<!-- baz --></div><!--qux--></body></html>";
Document doc = Jsoup.parse(h);
removeComments(doc);
doc.html(System.out);
}
private static void removeComments(Node node) {
for (int i = 0; i < node.childNodeSize();) {
Node child = node.childNode(i);
if (child.nodeName().equals("#comment"))
child.remove();
else {
removeComments(child);
i++;
}
}
}
}
Solution 2
With JSoup 1.11+ (possibly older version) you can apply a filter:
private void removeComments(Element article) {
article.filter(new NodeFilter() {
@Override
public FilterResult tail(Node node, int depth) {
if (node instanceof Comment) {
return FilterResult.REMOVE;
}
return FilterResult.CONTINUE;
}
@Override
public FilterResult head(Node node, int depth) {
if (node instanceof Comment) {
return FilterResult.REMOVE;
}
return FilterResult.CONTINUE;
}
});
}
Solution 3
reference @dlamblin https://stackoverflow.com/a/7541875/4712855 this code get comment html
public static void getHtmlComments(Node node) {
for (int i = 0; i < node.childNodeSize();i++) {
Node child = node.childNode(i);
if (child.nodeName().equals("#comment")) {
Comment comment = (Comment) child;
child.after(comment.getData());
child.remove();
}
else {
getHtmlComments(child);
}
}
}
Solution 4
This is a variation of the first example using a functional programming approach. The easiest way to find all comments, which are immediate children of the current node is to use .filter()
on a stream of .childNodes()
public void removeComments(Element e) {
e.childNodes().stream()
.filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
.forEach(n -> n.remove());
e.children().forEach(elem -> removeComments(elem));
}
Full example:
package demo;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.stream.Collectors;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Demo {
public static void removeComments(Element e) {
e.childNodes().stream()
.filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
.forEach(n -> n.remove());
e.children().forEach(elem -> removeComments(elem));
}
public static void main(String[] args) throws MalformedURLException, IOException {
Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);
// do not try this with JDK < 8
String userHome = System.getProperty("user.home");
PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
out.print(doc.outerHtml());
out.close();
removeComments(doc);
out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
out.print(doc.outerHtml());
out.close();
}
}
Related videos on Youtube
87element
Just me. SAP Stuff Java SE/EE/ME, Android C# .NET, Windows Phone 7
Updated on June 04, 2022Comments
-
87element almost 2 years
I would like to remove those tags with their content from source HTML.
-
dlamblin almost 7 yearsIf you can get a 6 year old version of Jsoup, it worked back then. Otherwise, if the api is updated, I welcome fixes to update this example. It looks like the childNodes list<node> was made unmodifiable in some version.
-
Robert Hanson about 6 yearsUsing JSoup 1.11.3 and Groovy... the only change I needed to make to get it to work was to change childNodesSize() to childNodeSize().
-
dlamblin almost 5 years@RobertHanson Thanks, that might have been a transcription error.