How to search for comments ("<!-- -->") using Jsoup?

12,490

Solution 1

When searching you basically use Elements.select(selector) where selector is defined by this API. However comments are not elements technically, so you may be confused here, still they are nodes identified by the node name #comment.

Let's see how that might work:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;

public class RemoveComments {
    public static void main(String... args) {
        String h = "<html><head></head><body>" +
          "<div><!-- foo --><p>bar<!-- baz --></div><!--qux--></body></html>";
        Document doc = Jsoup.parse(h);
        removeComments(doc);
        doc.html(System.out);
    }

    private static void removeComments(Node node) {
        for (int i = 0; i < node.childNodeSize();) {
            Node child = node.childNode(i);
            if (child.nodeName().equals("#comment"))
                child.remove();
            else {
                removeComments(child);
                i++;
            }
        }
    }        
}

Solution 2

With JSoup 1.11+ (possibly older version) you can apply a filter:

private void removeComments(Element article) {
    article.filter(new NodeFilter() {
        @Override
        public FilterResult tail(Node node, int depth) {
            if (node instanceof Comment) {
                return FilterResult.REMOVE;
            }
            return FilterResult.CONTINUE;
        }

        @Override
        public FilterResult head(Node node, int depth) {
            if (node instanceof Comment) {
                return FilterResult.REMOVE;
            }
            return FilterResult.CONTINUE;
        }
    });
}

Solution 3

reference @dlamblin https://stackoverflow.com/a/7541875/4712855 this code get comment html

public static void getHtmlComments(Node node) {
    for (int i = 0; i < node.childNodeSize();i++) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#comment")) {
            Comment comment = (Comment) child;
            child.after(comment.getData());
            child.remove();
        }
        else {
            getHtmlComments(child);
        }
    }
}

Solution 4

This is a variation of the first example using a functional programming approach. The easiest way to find all comments, which are immediate children of the current node is to use .filter() on a stream of .childNodes()

public void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

Full example:

package demo;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.stream.Collectors;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Demo {

public static void removeComments(Element e) {
    e.childNodes().stream()
        .filter(n -> n.nodeName().equals("#comment")).collect(Collectors.toList())
        .forEach(n -> n.remove());
    e.children().forEach(elem -> removeComments(elem));
}

public static void main(String[] args) throws MalformedURLException, IOException {
    Document doc = Jsoup.parse(new URL("https://en.wikipedia.org/"), 500);

    // do not try this with JDK < 8
    String userHome = System.getProperty("user.home");
    PrintStream out = new PrintStream(new FileOutputStream(userHome + File.separator + "before.html"));
    out.print(doc.outerHtml());
    out.close();

    removeComments(doc);
    out = new PrintStream(new FileOutputStream(userHome + File.separator + "after.html"));
    out.print(doc.outerHtml());
    out.close();
}

}

Share:
12,490

Related videos on Youtube

87element
Author by

87element

Just me. SAP Stuff Java SE/EE/ME, Android C# .NET, Windows Phone 7

Updated on June 04, 2022

Comments

  • 87element
    87element almost 2 years

    I would like to remove those tags with their content from source HTML.

  • dlamblin
    dlamblin almost 7 years
    If you can get a 6 year old version of Jsoup, it worked back then. Otherwise, if the api is updated, I welcome fixes to update this example. It looks like the childNodes list<node> was made unmodifiable in some version.
  • Robert Hanson
    Robert Hanson about 6 years
    Using JSoup 1.11.3 and Groovy... the only change I needed to make to get it to work was to change childNodesSize() to childNodeSize().
  • dlamblin
    dlamblin almost 5 years
    @RobertHanson Thanks, that might have been a transcription error.