Splitting XML file into multiple at given tags

24,429

Solution 1

Use Python ElementTree.

Create a file e.g. xmlsplitter.py. Add the code below (where file.xml is your xml file and assuming every row has a unique NAME element.).

import xml.etree.ElementTree as ET
context = ET.iterparse('file.xml', events=('end', ))
for event, elem in context:
    if elem.tag == 'row':
        title = elem.find('NAME').text
        filename = format(title + ".xml")
        with open(filename, 'wb') as f:
            f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
            f.write(ET.tostring(elem))

Run it with

python xmlsplitter.py

Or if the names are not unique:

import xml.etree.ElementTree as ET
context = ET.iterparse('file.xml', events=('end', ))
index = 0
for event, elem in context:
    if elem.tag == 'row':
        index += 1
        filename = format(str(index) + ".xml")
        with open(filename, 'wb') as f:
            f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
            f.write(ET.tostring(elem))

Solution 2

There's an excellent tool http://xmlstar.sourceforge.net/docs.php which can do a lot with xml (however it's not pythonic).

Given you have a 1.xml file with the data as above. And you need to split it to separate files with names NNN.xml with element /root/row.

Just call in shell:

    $ for ((i=1; i<=`xmlstarlet sel -t -v 'count(/root/row)'  1.xml`; i++)); do \
          echo '<?xml version="1.0" encoding="UTF-8"?><root>' > NAME.xml;
          NAME=$(xmlstarlet sel -t -m '/root/row[position()='$i']' -v './NAME' 1.xml); \
          xmlstarlet sel -t -m '/root/row[position()='$i']' -c . -n 1.xml >> $NAME.xml; \
          echo '</root>' >> NAME.xml
       done

Now you have a bunch of xml files like Joe.xml

Solution 3

This is the code which works perfect.

import xml.etree.ElementTree as ET

context = ET.iterparse('filname.xml', events=('end', ))
for event, elem in context:
if elem.tag == 'row':
    title = elem.find('NAME').text
    filename = format(title + ".xml")
    with open(filename, 'wb') as f:
        f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
        f.write("<root>\n")
        f.write(ET.tostring(elem))
        f.write("</root>")
Share:
24,429
Roger Sánchez
Author by

Roger Sánchez

Updated on May 08, 2020

Comments

  • Roger Sánchez
    Roger Sánchez almost 4 years

    I want to split a XML file into multiple files. My workstation is very limited to Eclipse Mars with Xalan 2.7.1.

    I can also use Python, but never used it before.

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <row>
            <NAME>Doe</NAME>
            <FIRSTNAME>Jon</FIRSTNAME>
            <GENDER>M</GENDER>
        </row>
        <row>
            <NAME>Mustermann</NAME>
            <FIRSTNAME>Max</FIRSTNAME>
            <GENDER>M</GENDER>
        </row>
    </root>
    

    How can I transform them to look like this

    <?xml version="1.0" encoding="UTF-8"?>
        <root>
            <row>
                <NAME>Doe</NAME>
                <FIRSTNAME>Jon</FIRSTNAME>
                <GENDER>M</GENDER>
            </row>
        </root>
    

    I need every "row"-data in a single file with header. The data above is just an example. Most of the "row"-data has 16 attributes, but it varies from time to time.

  • Roger Sánchez
    Roger Sánchez about 8 years
    Thank you Dan-Dev, I edited your code a bit and appended the "root" tag. One Question more, how can I append .xml to output files?
  • Dan-Dev
    Dan-Dev about 8 years
    I edited it a minute ago now it reads: filename = format(title + ".xml"). That should append the file extension .xml to your files if you run it again with the edited code
  • Roger Sánchez
    Roger Sánchez about 8 years
    Ok, just one problem more. Some NAME occur more than once. Is it possible to iterate through output filename starting with e.g. 1.xml?
  • Dan-Dev
    Dan-Dev about 8 years
    Edited adding the code after "Or if the names are not unique:"
  • Rafał Pydyniak
    Rafał Pydyniak about 7 years
    Looks like in Python3 you need write string like this f.write(b"<root>\n") note b letter before "<root>\n"