How to get node contents from JDOM

16,838

Solution 1

content.getText() gives immediate text which is only useful fine with the leaf elements with text content.

Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )

public static void main(String[] args) throws Exception {
    SAXBuilder builder = new SAXBuilder();
    String xmlFileName = "a.xml";
    Document doc = builder.build(xmlFileName);

    Element root = doc.getRootElement();
    Element overview = root.getChild("overview");
    Element content = overview.getChild("content");

    XMLOutputter outp = new XMLOutputter();

    outp.setFormat(Format.getCompactFormat());
    //outp.setFormat(Format.getRawFormat());
    //outp.setFormat(Format.getPrettyFormat());
    //outp.getFormat().setTextMode(Format.TextMode.PRESERVE);

    StringWriter sw = new StringWriter();
    outp.output(content.getContent(), sw);
    StringBuffer sb = sw.getBuffer();
    System.out.println(sb.toString());
}

Output

For more info click<a href="page.html">here</a><p>Learn more about the human body. Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>

Do explore other formatting options and modify above code to your need.

"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "

Solution 2

Well, maybe that's what you need:

import java.io.StringReader;

import org.custommonkey.xmlunit.XMLTestCase;
import org.custommonkey.xmlunit.XMLUnit;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import org.testng.annotations.Test;
import org.xml.sax.InputSource;

public class HowToGetNodeContentsJDOM extends XMLTestCase
{
    private static final String XML = "<root>\n" + 
            "  <program-title>Anatomy &amp; Physiology</program-title>\n" + 
            "  <overview>\n" + 
            "       <content>\n" + 
            "              For more info click <a href=\"page.html\">here</a>\n" + 
            "              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>\n" + 
            "       </content>\n" + 
            "  </overview>\n" + 
            "  <key-information>\n" + 
            "     <category>Health &amp; Human Services</category>\n" + 
            "  </key-information>\n" + 
            "</root>";
    private static final String EXPECTED = "For more info click <a href=\"page.html\">here</a>\n" + 
            "<p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>";

    @Test
    public void test() throws Exception
    {
        XMLUnit.setIgnoreWhitespace(true);
        Document document = new SAXBuilder().build(new InputSource(new StringReader(XML)));
        List<Content> content = document.getRootElement().getChild("overview").getChild("content").getContent();
        String out = new XMLOutputter().outputString(content);
        assertXMLEqual("<root>" + EXPECTED + "</root>", "<root>" + out + "</root>");
    }
}

Output:

PASSED: test on instance null(HowToGetNodeContentsJDOM)

===============================================
    Default test
    Tests run: 1, Failures: 0, Skips: 0
===============================================

I am using JDom with generics: http://www.junlu.com/list/25/883674.html

Edit: Actually that's not that much different from Prashant Bhate's answer. Maybe you need to tell us what you are missing...

Solution 3

You could try using method getValue() for the closest approximation, but what this does is concatenate all text within the element and descendants together. This won't give you the <p> tag in any form. If that tag is in your XML like you've shown, it has become part of the XML markup. It'd need to be included as &lt;p&gt; or embedded in a CDATA section to be treated as text.

Alternatively, if you know all elements that either may or may not appear in your XML, you could apply an XSLT transformation that turns stuff which isn't intended as markup into plain text.

Solution 4

If you're also generating the XML file you should be able to encapsulate your html data in <![CDATA[]]> so that it isn't parsed by the XML parser.

Share:
16,838
jeph perro
Author by

jeph perro

programmer/analyst/web developer

Updated on June 18, 2022

Comments

  • jeph perro
    jeph perro almost 2 years

    I'm writing an application in java using import org.jdom.*;

    My XML is valid,but sometimes it contains HTML tags. For example, something like this:

      <program-title>Anatomy &amp; Physiology</program-title>
      <overview>
           <content>
                  For more info click <a href="page.html">here</a>
                  <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>
           </content>
      </overview>
      <key-information>
         <category>Health &amp; Human Services</category>
    

    So my problem is with the < p > tags inside the overview.content node.

    I was hoping that this code would work :

            Element overview = sds.getChild("overview");
            Element content = overview.getChild("content");
    
            System.out.println(content.getText());
    

    but it returns blank.

    How do I return all the text ( nested tags and all ) from the overview.content node ?

    Thanks

  • G_H
    G_H over 12 years
    And then just manually turn Element type nodes into textual representation... Probably simpler than what I had in mind.
  • jeph perro
    jeph perro over 12 years
    No, unfortunately I don't generate the XML, I just have to consume it.
  • james.garriss
    james.garriss almost 11 years
    Perfect answer for those who don't need the element names in mixed content. Thank you!