Convert XML to dictionary in Python using lxml

14,539

Solution 1

Personally I like xmltodict from here. With pip you can install it like so pip install xmltodict.

Note that this actually creates OrderedDict objects. Example usage:

import xmltodict as xd

with open('test.xml','r') as f:
    d = xd.parse(f)

Solution 2

I found a solution in this gist: https://gist.github.com/jacobian/795571

def elem2dict(node):
    """
    Convert an lxml.etree node tree into a dict.
    """
    result = {}

    for element in node.iterchildren():
        # Remove namespace prefix
        key = element.tag.split('}')[1] if '}' in element.tag else element.tag

        # Process element as tree element if the inner XML contains non-whitespace content
        if element.text and element.text.strip():
            value = element.text
        else:
            value = elem2dict(element)
        if key in result:

            
            if type(result[key]) is list:
                result[key].append(value)
            else:
                tempvalue = result[key].copy()
                result[key] = [tempvalue, value]
        else:
            result[key] = value
    return result
Share:
14,539

Related videos on Youtube

proximous
Author by

proximous

Updated on September 15, 2022

Comments

  • proximous
    proximous about 1 year

    There seem to be lots of solutions on StackOverflow for converting XML to a Python dictionary, but none of them generate the output I'm looking for. I have the following XML:

    <?xml version="1.0" encoding="UTF-8"?>
    <status xmlns:mystatus="http://localhost/mystatus">
    <section1
        mystatus:field1="data1"
        mystatus:field2="data2" />
    <section2
        mystatus:lineA="outputA"
        mystatus:lineB="outputB" />
    </status>
    

    lxml has an elegantly simple solution for converting XML to a dictionary:

    def recursive_dict(element):
     return element.tag, dict(map(recursive_dict, element)) or element.text
    

    Unfortunately, I get:

    ('status', {'section2': None, 'section1': None})
    

    instead of:

    ('status', {'section2': 
                           {'field1':'data1','field2':'data2'}, 
                'section1': 
                           {'lineA':'outputA','lineB':'outputB'}
                })
    

    I can't figure out how to get my desired output without greatly complicating the recursive_dict() function.

    I'm not tied to lxml, and I'm also fine with a different organization of the dictionary, as long as it gives me all the info in the xml. Thanks!

    • GreenAsJade
      GreenAsJade about 9 years
      Out of curiosity, why do you expect to get the attributes of sections, but not the attributes of the status? What magic lets the library know that's what you want? And ... is there some reason why the contents of sections are attributes and not elements?
    • GreenAsJade
      GreenAsJade about 9 years
      It sounds like you have a solution, but I just want to note that the desired output you showed does not show ALL the information captured. It shows the attributes of sections captured, but not the attributes of status.
  • proximous
    proximous about 9 years
    Although I'd prefer to not need to install anything extra, this is very simple and looks like it will work so I'll give a try. Thanks!
  • proximous
    proximous about 9 years
    This works great! I prefer to reformat it with d=ast.literal_eval(json.dumps(d)) after the parse but the default output preserves everything perfect for me! Thanks!
  • AlexanderLedovsky
    AlexanderLedovsky over 7 years
    Be aware of xmltodict when working with big xmls. xmltodict is using python xml module from standard library. It becomes very slow when xml > 1 GB. Instead use lxml