How to change HTML tag content in Java?

20,424

Solution 1

Unless you are absolutely sure that the HTML will be valid and well formed, I'd strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc, the two first being especially powerful to parse any kind of crap :)

For example, with HTML Parser (because the implementation is very easy), using a visitor, provide your own NodeVisitor:

public class MyNodeVisitor extends NodeVisitor {
    public MyNodeVisitor() {
    }

    public void visitStringNode (Text string)
    {
        if (string.getText().equals("**text**")) {
            string.setText("**new text**");
        }
    }
}

Then, create a Parser, parse the HTML string and visit the returned node list:

Parser parser = new Parser(htmlString);
NodeList nl = parser.parse(null);
nl.visitAllNodesWith(new MyNodeVisitor());
System.out.println(nl.toHtml());

This is just one way to implement this, pretty straight forward.

Solution 2

Provided that your HTML is a well-formed XML (if it is not then you may use JTidy to tidify it), you can parse it using DOM or SAX parser. DOM is probably easier if your document is not huge.

Something like this will do the trick if your text is the only child of a node with id="id":

Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
Element e = d.getElementById("id");
Node text = e.getFirstChild();
text.setNodeValue(process(text.getNodeValue());

You may save d afterwards to a file.

Share:
20,424
bugisoft
Author by

bugisoft

Updated on July 09, 2022

Comments

  • bugisoft
    bugisoft almost 2 years

    How can I change HTML content of tag in Java? For example:

    before:

    <html>
        <head>
        </head>
        <body>
            <div>text<div>**text**</div>text</div>
        </body>
    </html>
    

    after:

    <html>
        <head>
        </head>
        <body>
            <div>text<div>**new text**</div>text</div>
        </body>
    </html>
    

    I tried JTidy, but it doesn't support getTextContent. Is there any other solution?


    Thanks, I want parse no well-formed HTML. I tried TagSoup, but when I have this code:

    <body>
    sometext <div>text</div>
    </body>
    

    and I want change "sometext" to "someAnotherText," and when I use {bodyNode}.getTextContent() it gives me: "sometext text"; when I use setTextContet("someAnotherText"+{bodyNode}.getTextContent()), and serialize these structure, the result is <body>someAnotherText sometext text</body>, without <div> tags. This is a problem for me.