Convert html to xml using java

22,830

Solution 1

Try jTidy

JTidy can be used as a tool for cleaning up malformed and faulty HTML

Solution 2

If you want to parse html than rather than converting html to xml you can use html parser. http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/ http://htmlparser.sourceforge.net/javadoc/doc-files/using.html I hope it helps you.

Solution 3

HTML is not the same as XML unless it is conforming XHTML or HTML5 in XML mode.

suggesting to use a HTML parser to read the HTML and transform it to XML – or process it directly.

Share:
22,830
suresh
Author by

suresh

Updated on October 22, 2020

Comments

  • suresh
    suresh over 3 years

    Can any one suggest me a best approach for converting html to xml using java Is there any API available for that? The html also might contain javascript code

    I have tried below code:

    import java.io.BufferedInputStream;
    import java.io.BufferedReader;
    import java.io.BufferedWriter;
    import java.io.DataInputStream;
    import java.io.FileOutputStream;
    import java.io.FileReader;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.net.URLConnection;
    import org.jdom.JDOMException;
    import org.jdom.input.SAXBuilder;
    import org.jdom.output.XMLOutputter;
    import java.io.IOException;
    
    class HTML2XML {
        public static void main(String args[]) throws JDOMException {
        InputStream isInHtml = null;
        URL url = null;
        URLConnection connection = null;
        DataInputStream disInHtml = null;
        FileOutputStream fosOutHtml = null;
        FileWriter fwOutXml = null;
        FileReader frInHtml = null;
        BufferedWriter bwOutXml = null;
        BufferedReader brInHtml = null;
        try {
            // url = new URL("www.climb.co.jp");
            // connection = url.openConnection();
            // isInHtml = connection.getInputStream();
    
            frInHtml = new FileReader("D:\\Second.html");
            brInHtml = new BufferedReader(frInHtml);
            SAXBuilder saxBuilder = new SAXBuilder(
                    "org.ccil.cowan.tagsoup.Parser", false);
            org.jdom.Document jdomDocument = saxBuilder.build(brInHtml);
    
            XMLOutputter outputter = new XMLOutputter();
            org.jdom.output.Format newFormat = outputter.getFormat();
            String encoding = "iso-8859-2";
            newFormat.setEncoding(encoding);
            outputter.setFormat(newFormat);
    
            try {
                outputter.output(jdomDocument, System.out);
                fwOutXml = new FileWriter("D:\\Second.xml");
                bwOutXml = new BufferedWriter(fwOutXml);
                outputter.output(jdomDocument, bwOutXml);
                System.out.flush();
            } catch (IOException e) {
            }
    
        } catch (IOException e) {
        } finally {
            System.out.flush();
            try {
                isInHtml.close();
                disInHtml.close();
                fosOutHtml.flush();
                fosOutHtml.getFD().sync();
                fosOutHtml.close();
                fwOutXml.flush();
                fwOutXml.close();
                bwOutXml.close();
            } catch (Exception w) {
    
            }
        }
    }
    }
    

    But its not working as expected

    • GolezTrol
      GolezTrol over 10 years
      Do you mean XHTML? And what about this Javascript code, what do you want to do with that?
    • suresh
      suresh over 10 years
      I have to convert normal html file to xml
    • GolezTrol
      GolezTrol over 10 years
      Do you need to convert them to XHTML? XHTML is an XML representation of HTML. 'Just' XML can be anything.
    • Clyde Lobo
      Clyde Lobo over 10 years
      Have you tried jtidy.sourceforge.net?
    • GolezTrol
      GolezTrol over 10 years
      Otherwise you can just embed the entire HTML document into a single XML element, as proven in this question. That is probably not what you want, but we need more info.