How to merge >1000 xml files into one using Java

10,655

Solution 1

You might also consider using StAX. Here's code that would do what you want:

import java.io.File;
import java.io.FileWriter;
import java.io.Writer;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.stream.StreamSource;

public class XMLConcat {
    public static void main(String[] args) throws Throwable {
        File dir = new File("/tmp/rootFiles");
        File[] rootFiles = dir.listFiles();

        Writer outputWriter = new FileWriter("/tmp/mergedFile.xml");
        XMLOutputFactory xmlOutFactory = XMLOutputFactory.newFactory();
        XMLEventWriter xmlEventWriter = xmlOutFactory.createXMLEventWriter(outputWriter);
        XMLEventFactory xmlEventFactory = XMLEventFactory.newFactory();

        xmlEventWriter.add(xmlEventFactory.createStartDocument());
        xmlEventWriter.add(xmlEventFactory.createStartElement("", null, "rootSet"));

        XMLInputFactory xmlInFactory = XMLInputFactory.newFactory();
        for (File rootFile : rootFiles) {
            XMLEventReader xmlEventReader = xmlInFactory.createXMLEventReader(new StreamSource(rootFile));
            XMLEvent event = xmlEventReader.nextEvent();
            // Skip ahead in the input to the opening document element
            while (event.getEventType() != XMLEvent.START_ELEMENT) {
                event = xmlEventReader.nextEvent();
            }

            do {
                xmlEventWriter.add(event);
                event = xmlEventReader.nextEvent();
            } while (event.getEventType() != XMLEvent.END_DOCUMENT);
            xmlEventReader.close();
        }

        xmlEventWriter.add(xmlEventFactory.createEndElement("", null, "rootSet"));
        xmlEventWriter.add(xmlEventFactory.createEndDocument());

        xmlEventWriter.close();
        outputWriter.close();
    }
}

One minor caveat is that this API seems to mess with empty tags, changing <foo/> into <foo></foo>.

Solution 2

Just do it without any xml-parsing as it doesn't seem to require any actual parsing of the xml.

For efficiency do something like this:

File dir = new File("/tmp/rootFiles");
String[] files = dir.list();
if (files == null) {
    System.out.println("No roots to merge!");
} else {
        try (FileChannel output = new FileOutputStream("output").getChannel()) {
            ByteBuffer buff = ByteBuffer.allocate(32);
            buff.put("<rootSet>\n".getBytes()); // specify encoding too
            buff.flip();
            output.write(buff);
            buff.clear();
            for (String file : files) {
                try (FileChannel in = new FileInputStream(new File(dir, file).getChannel()) {
                    in.transferTo(0, 1 << 24, output);
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            buff.put("</rootSet>\n".getBytes()); // specify encoding too
            buff.flip();
            output.write(buff);
        } catch (IOException e) {
            e.printStackTrace();
        }

Solution 3

DOM needs to keep the whole document in memory. If you don't need to do any special operation with your tags, I would simply use an InputStream and read all the files. If you need to do some operations, then use SAX.

Solution 4

Dom does consume a lot of memory. You have, imho, the following alternatives.

The best one is to use SAX. Using sax, only a very small amount of memory is used, cause basically nearly a single element is travelling from input to output at any given time, so memory footprint is extremely low. However, using sax is not so simple, cause compared to dom it is a bit counterintuitive.

Try Stax, not tried myself, but it's a kind of sax on steroids easier to implement and use, cause as opposed to just receiving sax events you don't control, you actually "ask the source" to stream you the elements you want, so it fits in the middle between dom and sax, has a memory footprint similar to sax, but a more friendly paradigm.

Sax, stax, dom are all important if you want to correctly preserve, declare etc... namespaces and other XML oddities.

However, if you just need a quick and dirty way, which will probably be namespace compliant as well, use plain old strings and writers.

Start outputting to the FileWriter the declaration and the root element of your "big" document. Then load, using dom if you like, each single file. Select the elements you want to end up in the "big" file, serialize them back to a string, and send the to the writer. the writer will flush to disk without using enormous amount of memory, and dom will load only one document per iteration. Unless you also have very big files on the input side, or plan to run it on a cellphone, you should not have a lot of memory problems. If dom serializes it correctly, it should preserve namespace declarations and the like, and the code will be just a bunch of lines more than the one you posted.

Solution 5

For this kind of work I will suggest not to use DOM, reading the file content and making a substring is simpler and enough.

I'm thinking of something like that :

String rootContent = document.substring(document.indexOf("<root>"), document.lastIndexOf("</root>")+7);

Then to avoid to much memory consummation. Write in the main file after every xml extraction with a BufferedWritter for example. For better performance you can also use java.nio.

Share:
10,655
Andra
Author by

Andra

Updated on June 15, 2022

Comments

  • Andra
    Andra almost 2 years

    I am trying to merge many xml files into one. I have successfully done that in DOM, but this solution is limited to a few files. When I run it on multiple files >1000 I am getting a java.lang.OutOfMemoryError.

    What I want to achieve is where i have the following files

    file 1:

    <root>
    ....
    </root>
    

    file 2:

    <root>
    ......
    </root>
    

    file n:

    <root>
    ....
    </root>
    

    resulting in: output:

    <rootSet>
    <root>
    ....
    </root>
    <root>
    ....
    </root>
    <root>
    ....
    </root>
    </rootSet>
    

    This is my current implementation:

        DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
        Document doc = docBuilder.newDocument();
        Element rootSetElement = doc.createElement("rootSet");
        Node rootSetNode = doc.appendChild(rootSetElement);
        Element creationElement = doc.createElement("creationDate");
        rootSetNode.appendChild(creationElement);
        creationElement.setTextContent(dateString); 
        File dir = new File("/tmp/rootFiles");
        String[] files = dir.list();
        if (files == null) {
            System.out.println("No roots to merge!");
        } else {
            Document rootDocument;
                for (int i=0; i<files.length; i++) {
                           File filename = new File(dir+"/"+files[i]);        
                   rootDocument = docBuilder.parse(filename);
                   Node tempDoc = doc.importNode((Node) Document.getElementsByTagName("root").item(0), true);
                   rootSetNode.appendChild(tempDoc);
            }
        }   
    

    I have experimented a lot with xslt, sax, but I seem to keep missing something. Any help would be highly appreciated