Python create tree from a JSON file

10,251

Solution 1

In your create_tree_from_JSON you never pass on the tree during recursion. Yet you try to return it.

def create_tree_from_JSON(json, parent=None):
    if not parent:
        tree = Tree()  # tree is only created for root node
        ...
    else:
        parent = parent  # tree is not created here
    ...
    return tree  # tree is always returned

Either pass on the tree during recursion, or separate the root step from the others:

def create_tree_from_JSON(json):  # root case
    tree = Tree()
    node_0 = TreeNode("ROOT")
    tree.root = node_0
    parent = node_0
    _walk_tree(json, parent)

def _walk_tree(json, parent):  # recursive case
    for key in json:
        if isinstance(json[key], dict):
            head = TreeNode(key)
            _walk_tree(json[key], head)
        else:
            node = TreeNode(key)
            node.add_child(TreeNode(json[key]))
        parent.add_child(node)

Note that what you are doing can be solved much easier using plain dicts. Your class is effectively just wrapping a custom interface around dict to begin with.

def treeify(data) -> dict:
    if isinstance(data, dict):  # already have keys, just recurse
       return {key: treeify(children) for key, children in data.items()}
    elif isinstance(data, list):  # make keys from indices
       return {idx: treeify(children) for idx, children in enumerate(data, start=1)}
    else:  # leave node, no recursion
       return data

You can feed any decoded json data to this.

>>> treeify(json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })
{'name': 'John', 'items': {1: {'item_name': 'lettuce', 'price': 2.65, 'units': 'no'}, 2: {'item_name': 'ketchup', 'price': 1.51, 'units': 'litres'}}}

To get the desired pretty-printed output, you can walk this structure with a stack of current keys. A generator is appropriate to create each query line on the fly:

def format_query(tree, stack=('content_json',)) -> str:
    if isinstance(tree, dict):  # build stack of keys
        for key, child in tree.items():
            yield from format_query(child, stack + (key,))
    else:  # print complete stack, discarding leaf data in tree
       *keys, field = stack
       path = ' -> '.join(
           str(key) if isinstance(key, int) else "'%s'" % key
           for key in keys
       )
       yield path + " ->> '%s' as %s" % (field, field)

Given your second example, this allows you to get a list of query lines:

>>> list(format_query(treeify({ "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })))
["'content_json' ->> 'name' as name",
 "'content_json' -> 'items' -> 1 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 1 ->> 'price' as price",
 "'content_json' -> 'items' -> 1 ->> 'units' as units",
 "'content_json' -> 'items' -> 2 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 2 ->> 'price' as price",
 "'content_json' -> 'items' -> 2 ->> 'units' as units"]

Solution 2

You can use recursion:

def format_query(d):
  if all(not isinstance(i, tuple) for i in d):
    return 'select\n{}\nfrom table_with_json'.format(',\n'.join('\tcontent_json {}'.format("->> '{}' as {}".format(i[0], i[0]) if len(i) == 1 else "-> {} ->> '{}' as {}".format(' -> '.join("'{}'".format(j) for j in i[:-1]), i[-1], i[-1])) for i in d))
  return '\n\n'.join(format_query([c for b in i for c in b]) for i in d)

def get_dict(d, c = []):
  for a, b in d.items():
     if not isinstance(b, (dict, list)):
       yield c+[a]
     elif isinstance(b, dict):
       yield from to_query(b, c+[a])

def to_query(d, q = []):
  if not any(isinstance(i, list) for i in d.values()):
     yield from get_dict(d, c=q)
  else:
     _c = list(get_dict(d))
     for a, b in d.items():
       if isinstance(b, list):
         for i, j in enumerate(b, 1):
            yield (_c, list(get_dict(j, [a, i])))

Now, to format:

json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }
print(format_query(list(to_query(json_file))))

Output:

select
      content_json ->> 'name' as name,
      content_json -> 'items' -> '1' ->> 'item_name' as item_name,
      content_json -> 'items' -> '1' ->> 'price' as price,
      content_json -> 'items' -> '1' ->> 'units' as units
from table_with_json

select
     content_json ->> 'name' as name,
     content_json -> 'items' -> '2' ->> 'item_name' as item_name,
     content_json -> 'items' -> '2' ->> 'price' as price,
     content_json -> 'items' -> '2' ->> 'units' as units
from table_with_json
Share:
10,251
balkon16
Author by

balkon16

R and Python user.

Updated on June 05, 2022

Comments

  • balkon16
    balkon16 almost 2 years

    Let's say that we have the following JSON file. For the sake of the example it's emulated by a string. The string is the input and a Tree object should be the output. I'll be using the graphical notation of a tree to present the output.

    I've found the following classes to handle tree concept in Python:

    class TreeNode(object):
        def __init__(self, data):
            self.data = data
            self.children = []
    
        def add_child(self, obj):
            self.children.append(obj)
    
        def __str__(self, level=0):
            ret = "\t"*level+repr(self.data)+"\n"
            for child in self.children:
                ret += child.__str__(level+1)
            return ret
    
        def __repr__(self):
            return '<tree node representation>'
    
    class Tree:
        def __init__(self):
            self.root = TreeNode('ROOT')
    
        def __str__(self):
            return self.root.__str__()
    

    The input file can be of different complexity:

    Simple case

    Input:

    json_file = '{"item1": "end1", "item2": "end2"}'

    Output:

    "ROOT"
        item1
            end1
        item2
            end2
    

    Embedded case

    Input:

    json_file = {"item1": "end1", "item2": {"item3": "end3"}}

    Output:

    "ROOT"
        item1
            end1
        item2
            item3
                end3
    

    Array case

    Input:

    json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }

    Output:

    "ROOT"
        name
            John
        items
            1
                item_name
                    lettuce
                price
                    2.65
                units
                    no
            2   
                item_name
                    ketchup
                price
                    1.51
                units
                    litres
    

    Please note that each item in an array is described with an integer (starting at 1).

    So far I've managed to come up with the following function that solves the problem for the simple case. In terms of the embedded case I know that I must use recursion but so far I get UnboundLocalError: local variable 'tree' referenced before assignment.

    def create_tree_from_JSON(json, parent=None):
        if not parent:
            tree = Tree()
            node_0 = TreeNode("ROOT")
            tree.root = node_0
            parent = node_0
        else:
            parent = parent
    
        for key in json:
            if isinstance(json[key], dict):
                head = TreeNode(key)
                create_tree_from_JSON(json[key], head)
            else:
                node = TreeNode(key)
                node.add_child(TreeNode(json[key]))
                parent.add_child(node)
    
        return tree
    

    Problem's background

    You may wonder why would I need to change a JSON object into a tree. As you may know PostgreSQL provides a way to handle JSON fields in the database. Given a JSON object I can get the value of any field by using -> and ->> notation. Here and here more about the subject. I will be creating new tables based on the fields' names and values. Unfortunately the JSON objects vary to such an extent that I cannot write the .sql code manually - I must find a way to do it automatically.

    Let's assume that I want to create a table based on the embedded case. I need to get the following .sql code:

    select 
        content_json ->> 'item1' as end1,
        content_json -> 'item_2' ->> 'item_3' as end3
    from table_with_json
    

    Substitute content_json for "ROOT" and you can see that each line in SQL code is simply a depth-first traversal from "ROOT" to a leaf (move from the last node to leaf is always annotated with ->>).

    EDIT: In order to make the question more clear I'm adding the target .sql query for the array case. I would like there to be as many queries as there are elements in the array:

    select
        content_json ->> 'name' as name,
        content_json -> 'items' -> 1 -> 'item_name' as item_name,
        content_json -> 'items' -> 1 -> 'price' as price,
        content_json -> 'items' -> 1 -> 'units' as units
    from table_with_json
    
    select
        content_json ->> 'name' as name,
        content_json -> 'items' -> 2 ->> 'item_name' as item_name,
        content_json -> 'items' -> 2 ->> 'price' as price,
        content_json -> 'items' -> 2 ->> 'units' as units
    from table_with_json
    

    Solution so far (07.05.2019)

    I'm testing the current solution for the moment:

    from collections import OrderedDict
    
    def treeify(data) -> dict:
        if isinstance(data, dict):  # already have keys, just recurse
            return OrderedDict((key, treeify(children)) for key, children in data.items())
        elif isinstance(data, list):  # make keys from indices
            return OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1))
        else:  # leave node, no recursion
            return data
    
    def format_query(tree, stack=('content_json',)) -> str:
        if isinstance(tree, dict):  # build stack of keys
            for key, child in tree.items():
                yield from format_query(child, stack + (key,))
        else:  # print complete stack, discarding leaf data in tree
            *keys, field = stack
            path = ' -> '.join(
                str(key) if isinstance(key, int) else "'%s'" % key
                for key in keys
            )
            yield path + " ->> '%s' as %s" % (field, field)
    
    def create_select_query(lines_list):
        query = "select\n"
        for line_number in range(len(lines_list)):
            if "_class" in lines_list[line_number]:
                # ignore '_class' fields
                continue
            query += "\t" + lines_list[line_number]
            if line_number == len(lines_list)-1:
                query += "\n"
            else:
                query += ",\n"
        query += "from table_with_json"
        return query
    

    I'm currently working on a JSON like this:

    stack_nested_example = {"_class":"value_to_be_ignored","first_key":{"second_key":{"user_id":"123456","company_id":"9876","question":{"subject":"some_subject","case_type":"urgent","from_date":{"year":2011,"month":11,"day":11},"to_date":{"year":2012,"month":12,"day":12}},"third_key":[{"role":"driver","weather":"great"},{"role":"father","weather":"rainy"}]}}}
    

    In the output I get the only constant element is the order of lines treated with array logic. Order of other lines differs. The output I would like to get is the one that takes into account order of the keys:

    select
            'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id,
            'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month,
            'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day,
            'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role,
            'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather,
            'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role,
            'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather
    from table_with_json
    
  • balkon16
    balkon16 about 5 years
    Thanks for the comment. Your answer works for printing out the tree but as I've pointed in the question I need to generate a tree and I'm using graphical (with tabs) representation to present output I'd like to obtain :)
  • Ajax1234
    Ajax1234 about 5 years
    @balkon16 Opps, thank you for pointing that out. What would your SQL query look like for the input with the list?
  • Ajax1234
    Ajax1234 about 5 years
    @balkon16 Thank you. In the inputs with no arrays, the values of the final key are included in the result i.e content_json -> 'item_1' ->> 'end1' as end1,. However, with the array, you only keep the key, content_json ->> 'name' as name,, and not provide the value, content_json -> 'name' ->> 'John' as John. Can you clarify this for me? Thank you.
  • balkon16
    balkon16 about 5 years
    this is my mistake. The .sql query for the embedded case (no arrays) follows the same logic as the array case.
  • balkon16
    balkon16 about 5 years
    Thank you for your input. Please note that I've changed the logic in the embedded case. The last item in the chain is the last key preceded by ->>.
  • Ajax1234
    Ajax1234 about 5 years
    @balkon16 Thanks for your patience :) Please see my recent edit.
  • balkon16
    balkon16 about 5 years
    I get the NameError: name 'keys' is not defined while trying to replicate your answer
  • balkon16
    balkon16 about 5 years
    I get SyntaxError: invalid syntax and the following location is indicated by the Python parser: ...'\tcontent_json {}'.format(f"->> '{i[0]}' as {i[0]}" (the last quotation mark)
  • MisterMiyagi
    MisterMiyagi about 5 years
    @balkon16 My bad, format_query should work now.
  • balkon16
    balkon16 about 5 years
    Thanks for the edit. It turns out that 'content_json' must be unquoted (content_json). I decided to go with a rather unelegant solution: path = ' -> '.join(str(key) if isinstance(key, int) else "'%s'" % key for key in keys).replace("'content_json'", "content_json") (added replace(... part. Do you happen to know a more elegant solution?
  • balkon16
    balkon16 about 5 years
    Another question: is a there a way to change the current implementation that uses dict into one that uses collections.OrderedDict so that the order of the fields in JSON file is preserved?
  • MisterMiyagi
    MisterMiyagi about 5 years
    @balkon16 If you are using a sufficiently recent version of python, dict is insertion-ordered. Practically all current PyPy version should do, and CPython from 3.6 onwards as well.
  • balkon16
    balkon16 about 5 years
    I'm aiming at my solution working for Python 3.5.2, which doesn't support insertion-ordered dicts
  • MisterMiyagi
    MisterMiyagi about 5 years
    @balkon16 You can convert the dict comprehensions to generator expressions inside an ordered dict. Basically replace {a: b ...} by OrderedDict((a, b) ...). For example OrderedDict((key, treeify(children)) for key, children in data.items()) for the first dict comprehension.
  • Ajax1234
    Ajax1234 about 5 years
    @balkon16 That is the original code used f-strings, which are only available in Python3.7. Please try now, as I replaced the f-strings with general string formatting.
  • balkon16
    balkon16 almost 5 years
    Hmmm that's weird. Now I get the IndexError: tuple index out of range
  • balkon16
    balkon16 almost 5 years
    even though I've added the OrderedDict() in both of the comprehensions but still end up with the solution (please see the edit from 07.05.2019) which doesn't keep the order of items
  • MisterMiyagi
    MisterMiyagi almost 5 years
    @balkon16 If you load your JSON to a dict, you loose order there already. You must load the JSON using OrderedDict to represent objects with ordering: stackoverflow.com/questions/6921699/…
  • Ajax1234
    Ajax1234 almost 5 years
    @balkon16 Strange. Please post the input that raises the error.
  • balkon16
    balkon16 almost 5 years
    It's the "Now, to format:" part of your answer: json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] } and then print(format_query(list(to_query(json_file))))
  • Ajax1234
    Ajax1234 almost 5 years
    @balkon16 Ah, the change from f-strings to str.format broke part of the formatting. It works for me now. Sorry for the delay!