Python create tree from a JSON file
Solution 1
In your create_tree_from_JSON
you never pass on the tree during recursion. Yet you try to return it.
def create_tree_from_JSON(json, parent=None):
if not parent:
tree = Tree() # tree is only created for root node
...
else:
parent = parent # tree is not created here
...
return tree # tree is always returned
Either pass on the tree
during recursion, or separate the root step from the others:
def create_tree_from_JSON(json): # root case
tree = Tree()
node_0 = TreeNode("ROOT")
tree.root = node_0
parent = node_0
_walk_tree(json, parent)
def _walk_tree(json, parent): # recursive case
for key in json:
if isinstance(json[key], dict):
head = TreeNode(key)
_walk_tree(json[key], head)
else:
node = TreeNode(key)
node.add_child(TreeNode(json[key]))
parent.add_child(node)
Note that what you are doing can be solved much easier using plain dict
s. Your class is effectively just wrapping a custom interface around dict
to begin with.
def treeify(data) -> dict:
if isinstance(data, dict): # already have keys, just recurse
return {key: treeify(children) for key, children in data.items()}
elif isinstance(data, list): # make keys from indices
return {idx: treeify(children) for idx, children in enumerate(data, start=1)}
else: # leave node, no recursion
return data
You can feed any decoded json data to this.
>>> treeify(json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })
{'name': 'John', 'items': {1: {'item_name': 'lettuce', 'price': 2.65, 'units': 'no'}, 2: {'item_name': 'ketchup', 'price': 1.51, 'units': 'litres'}}}
To get the desired pretty-printed output, you can walk this structure with a stack of current keys. A generator is appropriate to create each query line on the fly:
def format_query(tree, stack=('content_json',)) -> str:
if isinstance(tree, dict): # build stack of keys
for key, child in tree.items():
yield from format_query(child, stack + (key,))
else: # print complete stack, discarding leaf data in tree
*keys, field = stack
path = ' -> '.join(
str(key) if isinstance(key, int) else "'%s'" % key
for key in keys
)
yield path + " ->> '%s' as %s" % (field, field)
Given your second example, this allows you to get a list of query lines:
>>> list(format_query(treeify({ "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })))
["'content_json' ->> 'name' as name",
"'content_json' -> 'items' -> 1 ->> 'item_name' as item_name",
"'content_json' -> 'items' -> 1 ->> 'price' as price",
"'content_json' -> 'items' -> 1 ->> 'units' as units",
"'content_json' -> 'items' -> 2 ->> 'item_name' as item_name",
"'content_json' -> 'items' -> 2 ->> 'price' as price",
"'content_json' -> 'items' -> 2 ->> 'units' as units"]
Solution 2
You can use recursion:
def format_query(d):
if all(not isinstance(i, tuple) for i in d):
return 'select\n{}\nfrom table_with_json'.format(',\n'.join('\tcontent_json {}'.format("->> '{}' as {}".format(i[0], i[0]) if len(i) == 1 else "-> {} ->> '{}' as {}".format(' -> '.join("'{}'".format(j) for j in i[:-1]), i[-1], i[-1])) for i in d))
return '\n\n'.join(format_query([c for b in i for c in b]) for i in d)
def get_dict(d, c = []):
for a, b in d.items():
if not isinstance(b, (dict, list)):
yield c+[a]
elif isinstance(b, dict):
yield from to_query(b, c+[a])
def to_query(d, q = []):
if not any(isinstance(i, list) for i in d.values()):
yield from get_dict(d, c=q)
else:
_c = list(get_dict(d))
for a, b in d.items():
if isinstance(b, list):
for i, j in enumerate(b, 1):
yield (_c, list(get_dict(j, [a, i])))
Now, to format:
json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }
print(format_query(list(to_query(json_file))))
Output:
select
content_json ->> 'name' as name,
content_json -> 'items' -> '1' ->> 'item_name' as item_name,
content_json -> 'items' -> '1' ->> 'price' as price,
content_json -> 'items' -> '1' ->> 'units' as units
from table_with_json
select
content_json ->> 'name' as name,
content_json -> 'items' -> '2' ->> 'item_name' as item_name,
content_json -> 'items' -> '2' ->> 'price' as price,
content_json -> 'items' -> '2' ->> 'units' as units
from table_with_json
Comments
-
balkon16 almost 2 years
Let's say that we have the following JSON file. For the sake of the example it's emulated by a string. The string is the input and a
Tree
object should be the output. I'll be using the graphical notation of a tree to present the output.I've found the following classes to handle tree concept in Python:
class TreeNode(object): def __init__(self, data): self.data = data self.children = [] def add_child(self, obj): self.children.append(obj) def __str__(self, level=0): ret = "\t"*level+repr(self.data)+"\n" for child in self.children: ret += child.__str__(level+1) return ret def __repr__(self): return '<tree node representation>' class Tree: def __init__(self): self.root = TreeNode('ROOT') def __str__(self): return self.root.__str__()
The input file can be of different complexity:
Simple case
Input:
json_file = '{"item1": "end1", "item2": "end2"}'
Output:
"ROOT" item1 end1 item2 end2
Embedded case
Input:
json_file = {"item1": "end1", "item2": {"item3": "end3"}}
Output:
"ROOT" item1 end1 item2 item3 end3
Array case
Input:
json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }
Output:
"ROOT" name John items 1 item_name lettuce price 2.65 units no 2 item_name ketchup price 1.51 units litres
Please note that each item in an array is described with an integer (starting at 1).
So far I've managed to come up with the following function that solves the problem for the simple case. In terms of the embedded case I know that I must use recursion but so far I get
UnboundLocalError: local variable 'tree' referenced before assignment
.def create_tree_from_JSON(json, parent=None): if not parent: tree = Tree() node_0 = TreeNode("ROOT") tree.root = node_0 parent = node_0 else: parent = parent for key in json: if isinstance(json[key], dict): head = TreeNode(key) create_tree_from_JSON(json[key], head) else: node = TreeNode(key) node.add_child(TreeNode(json[key])) parent.add_child(node) return tree
Problem's background
You may wonder why would I need to change a JSON object into a tree. As you may know PostgreSQL provides a way to handle JSON fields in the database. Given a JSON object I can get the value of any field by using
->
and->>
notation. Here and here more about the subject. I will be creating new tables based on the fields' names and values. Unfortunately the JSON objects vary to such an extent that I cannot write the.sql
code manually - I must find a way to do it automatically.Let's assume that I want to create a table based on the embedded case. I need to get the following
.sql
code:select content_json ->> 'item1' as end1, content_json -> 'item_2' ->> 'item_3' as end3 from table_with_json
Substitute
content_json
for"ROOT"
and you can see that each line in SQL code is simply a depth-first traversal from "ROOT" to a leaf (move from the last node to leaf is always annotated with->>
).EDIT: In order to make the question more clear I'm adding the target
.sql
query for the array case. I would like there to be as many queries as there are elements in the array:select content_json ->> 'name' as name, content_json -> 'items' -> 1 -> 'item_name' as item_name, content_json -> 'items' -> 1 -> 'price' as price, content_json -> 'items' -> 1 -> 'units' as units from table_with_json select content_json ->> 'name' as name, content_json -> 'items' -> 2 ->> 'item_name' as item_name, content_json -> 'items' -> 2 ->> 'price' as price, content_json -> 'items' -> 2 ->> 'units' as units from table_with_json
Solution so far (07.05.2019)
I'm testing the current solution for the moment:
from collections import OrderedDict def treeify(data) -> dict: if isinstance(data, dict): # already have keys, just recurse return OrderedDict((key, treeify(children)) for key, children in data.items()) elif isinstance(data, list): # make keys from indices return OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1)) else: # leave node, no recursion return data def format_query(tree, stack=('content_json',)) -> str: if isinstance(tree, dict): # build stack of keys for key, child in tree.items(): yield from format_query(child, stack + (key,)) else: # print complete stack, discarding leaf data in tree *keys, field = stack path = ' -> '.join( str(key) if isinstance(key, int) else "'%s'" % key for key in keys ) yield path + " ->> '%s' as %s" % (field, field) def create_select_query(lines_list): query = "select\n" for line_number in range(len(lines_list)): if "_class" in lines_list[line_number]: # ignore '_class' fields continue query += "\t" + lines_list[line_number] if line_number == len(lines_list)-1: query += "\n" else: query += ",\n" query += "from table_with_json" return query
I'm currently working on a JSON like this:
stack_nested_example = {"_class":"value_to_be_ignored","first_key":{"second_key":{"user_id":"123456","company_id":"9876","question":{"subject":"some_subject","case_type":"urgent","from_date":{"year":2011,"month":11,"day":11},"to_date":{"year":2012,"month":12,"day":12}},"third_key":[{"role":"driver","weather":"great"},{"role":"father","weather":"rainy"}]}}}
In the output I get the only constant element is the order of lines treated with array logic. Order of other lines differs. The output I would like to get is the one that takes into account order of the keys:
select 'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id, 'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id, 'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject, 'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type, 'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year, 'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month, 'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day, 'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year, 'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month, 'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day, 'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role, 'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather, 'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role, 'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather from table_with_json
-
balkon16 about 5 yearsThanks for the comment. Your answer works for printing out the tree but as I've pointed in the question I need to generate a tree and I'm using graphical (with tabs) representation to present output I'd like to obtain :)
-
Ajax1234 about 5 years@balkon16 Opps, thank you for pointing that out. What would your SQL query look like for the input with the list?
-
Ajax1234 about 5 years@balkon16 Thank you. In the inputs with no arrays, the values of the final key are included in the result i.e
content_json -> 'item_1' ->> 'end1' as end1,
. However, with the array, you only keep the key,content_json ->> 'name' as name,
, and not provide the value,content_json -> 'name' ->> 'John' as John
. Can you clarify this for me? Thank you. -
balkon16 about 5 yearsthis is my mistake. The
.sql
query for the embedded case (no arrays) follows the same logic as the array case. -
balkon16 about 5 yearsThank you for your input. Please note that I've changed the logic in the embedded case. The last item in the chain is the last key preceded by
->>
. -
Ajax1234 about 5 years@balkon16 Thanks for your patience :) Please see my recent edit.
-
balkon16 about 5 yearsI get the
NameError: name 'keys' is not defined
while trying to replicate your answer -
balkon16 about 5 yearsI get
SyntaxError: invalid syntax
and the following location is indicated by the Python parser:...'\tcontent_json {}'.format(f"->> '{i[0]}' as {i[0]}"
(the last quotation mark) -
MisterMiyagi about 5 years@balkon16 My bad,
format_query
should work now. -
balkon16 about 5 yearsThanks for the edit. It turns out that
'content_json'
must be unquoted (content_json
). I decided to go with a rather unelegant solution:path = ' -> '.join(str(key) if isinstance(key, int) else "'%s'" % key for key in keys).replace("'content_json'", "content_json")
(addedreplace(...
part. Do you happen to know a more elegant solution? -
balkon16 about 5 yearsAnother question: is a there a way to change the current implementation that uses
dict
into one that usescollections.OrderedDict
so that the order of the fields in JSON file is preserved? -
MisterMiyagi about 5 years@balkon16 If you are using a sufficiently recent version of python,
dict
is insertion-ordered. Practically all current PyPy version should do, and CPython from 3.6 onwards as well. -
balkon16 about 5 yearsI'm aiming at my solution working for Python 3.5.2, which doesn't support insertion-ordered
dict
s -
MisterMiyagi about 5 years@balkon16 You can convert the dict comprehensions to generator expressions inside an ordered dict. Basically replace
{a: b ...}
byOrderedDict((a, b) ...)
. For exampleOrderedDict((key, treeify(children)) for key, children in data.items())
for the first dict comprehension. -
Ajax1234 about 5 years@balkon16 That is the original code used
f-strings
, which are only available in Python3.7. Please try now, as I replaced thef-strings
with general string formatting. -
balkon16 almost 5 yearsHmmm that's weird. Now I get the
IndexError: tuple index out of range
-
balkon16 almost 5 yearseven though I've added the
OrderedDict()
in both of the comprehensions but still end up with the solution (please see the edit from 07.05.2019) which doesn't keep the order of items -
MisterMiyagi almost 5 years@balkon16 If you load your JSON to a
dict
, you loose order there already. You must load the JSON using OrderedDict to represent objects with ordering: stackoverflow.com/questions/6921699/… -
Ajax1234 almost 5 years@balkon16 Strange. Please post the input that raises the error.
-
balkon16 almost 5 yearsIt's the "Now, to format:" part of your answer:
json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }
and thenprint(format_query(list(to_query(json_file))))
-
Ajax1234 almost 5 years@balkon16 Ah, the change from
f-string
s tostr.format
broke part of the formatting. It works for me now. Sorry for the delay!