fastest way to create JSON to reflect a tree structure in Python / Django using mptt

13,860

Solution 1

I suspect by far the biggest slowdown is that this will do 1 database query per node. The json rendering is trivial in comparison to the hundreds of round-trips to your database.

You should cache the children on each node so that those queries can be done all at once. django-mptt has a cache_tree_children() function you can do this with.

import json
from mptt.templatetags.mptt_tags import cache_tree_children

def recursive_node_to_dict(node):
    result = {
        'id': node.pk,
        'name': node.name,
    }
    children = [recursive_node_to_dict(c) for c in node.get_children()]
    if children:
        result['children'] = children
    return result

root_nodes = cache_tree_children(Node.objects.all())
dicts = []
for n in root_nodes:
    dicts.append(recursive_node_to_dict(n))

print json.dumps(dicts, indent=4)

Custom json encoding, while it might provide a slight speedup in some scenarios, is something I'd highly discourage, as it will be a lot of code, and it's something that's easy to get very wrong.

Solution 2

Your updated version looks like there would be very little overhead. I think it would be slightly more efficient (and more readable, too!) to use a list comprehension:

def serializable_object(node):
    "Recurse into tree to build a serializable object"
    obj = {
        'name': node.name,
        'children': [serializable_object(ch) for ch in node.get_children()]
    }
    return obj

Besides that, all you can do is profile it to find the bottlenecks. Write some standalone code that loads and serializes your 300 nodes and then run it with

python -m profile serialize_benchmark.py

(or -m cProfile if that works better).

The can see 3 different potential bottlenecks:

  • DB access (.get_children() and .name) -- I'm not sure exactly what's going on under the hood, but I've had code like this that does a DB query for each node adding a tremendous overhead. If that's your problem, you can probably configure this to do an "eager load" using select_related or something similar.
  • function call overhead (e.g. serializable_object itself) -- Just make sure ncalls for serializable_object looks like a reasonable number. If I understand your description, it should be in the neighborhood of 300.
  • serializing at the end (json.dumps(nodeInstance)) -- Not a likely culprit since you said it's only 300 nodes, but if you do see this taking up a lot of execution time, make sure you have the compiled speedups for JSON working properly.

If you can't tell much from profiling it, make a stripped-down version that, say, recursively calls node.name and node.get_children() but doesn't store the results in a data structure, and see how that compares.


Update: There are 2192 calls to execute_sql in solution 3 and 2192 in solution 5, so I think that excessive DB queries is a problem and that select_related did nothing the way it's used above. Looking at django-mptt issue #88: Allow select_related in model methods suggests that you're using it more-or-less right, but I have my doubt, and get_children vs. get_descendants might make a huge difference.

There's also a ton of time being taken up by copy.deepcopy, which is puzzling because you're not directly calling it, and I don't see it called from the MPTT code. What's tree.py?

If you're doing a lot of work with profiling, I'd highly recommend the really slick tool RunSnakeRun, which lets you see your profile data in a really convenient grid form and make sense of the data more quickly.

Anyway, here's one more attempt at streamlining the DB side of things:

import weakref
obj_cache = weakref.WeakValueDictionary()

def serializable_object(node):
    root_obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(),
            'id': node.pk, 'level': node.level, 'position': node.position,
            'children': []}
    obj_cache[node.pk] = root_obj
    # don't know if the following .select_related() does anything...
    for descendant in node.get_descendants().select_related():
        # get_descendants supposedly traverses in "tree order", which I think
        # means the parent obj will always be created already
        parent_obj = obj_cache[descendant.parent.pk]    # hope parent is cached
        descendant_obj = {'name': descendant.get_wbs_code(),
            'wbsCode': descendant.get_wbs_code(), 'id': descendant.pk,
            'level': descendant.level, 'position': descendant.position,
            'children': []}
        parent_obj['children'].append(descendant_obj)
        obj_cache[descendant.pk] = descendant_obj
    return root_obj

Note this is no longer recursive. It proceeds iteratively through nodes, theoretically after their parents have been visited, and it's all using one big call to MPTTModel.get_descendants(), so hopefully that's well-optimized and caches .parent, etc. (or maybe there's a more direct way to get at the parent key?). It creates each obj with no children initially, then "grafts" all the values to their parents afterwards.

Share:
13,860

Related videos on Youtube

Thomas Kremmel
Author by

Thomas Kremmel

Updated on June 17, 2022

Comments

  • Thomas Kremmel
    Thomas Kremmel about 2 years

    What's the fastest way in Python (Django) to create a JSON based upon a Django queryset. Note that parsing it in the template as proposed here is not an option.

    The background is that I created a method which loops over all nodes in a tree, but is already terribly slow when converting about 300 nodes. The first (and probably the worst) idea that came up to my mind is to create the json somehow "manually". See the code below.

    #! Solution 1 !!#
    def quoteStr(input):
        return "\"" + smart_str(smart_unicode(input)) + "\""
    
    def createJSONTreeDump(user, node, root=False, lastChild=False):
        q = "\""
    
        #open tag for object
        json = str("\n" + indent + "{" +
                      quoteStr("name") + ": " + quoteStr(node.name) + ",\n" +
                      quoteStr("id") + ": " + quoteStr(node.pk) + ",\n" +
                    )
    
        childrenTag = "children"
        children = node.get_children()
        if children.count() > 0 :
            #create children array opening tag
            json += str(indent + quoteStr(childrenTag) + ": [")
            #for child in children:
            for idx, child in enumerate(children):
                if (idx + 1) == children.count():
                    //recursive call
                    json += createJSONTreeDump(user, child, False, True, layout)
                else:
                    //recursive call
                    json += createJSONTreeDump(user, child, False, False, layout)
            #add children closing tag
            json += "]\n"
    
        #closing tag for object
        if lastChild == False:
            #more children following, add ","
            json += indent + "},\n"
        else:
            #last child, do not add ","
            json += indent + "}\n"
        return json
    

    The tree structure to be rendered is a tree build up with mptt, where the call .get_children() returns all children of a node.

    The model looks as simply as this, mptt taking care of everything else.

    class Node(MPTTModel, ExtraManager):
        """
        Representation of a single node
        """ 
        name = models.CharField(max_length=200)
        parent = TreeForeignKey('self', null=True, blank=True, related_name='%(app_label)s_%(class)s_children')
    

    The expected JSON result created like this in the template var root = {{ jsonTree|safe }}

    Edit: Based upon this answer I created the following code (definitely the better code) but feels only slightly faster.

    Solution 2:

    def serializable_object(node):
        "Recurse into tree to build a serializable object"
        obj = {'name': node.name, 'id': node.pk, 'children': []}
        for child in node.get_children():
            obj['children'].append(serializable_object(child))
        return obj
    
    import json
    jsonTree = json.dumps(serializable_object(nodeInstance))
    

    Solution 3:

    def serializable_object_List_Comprehension(node):
        "Recurse into tree to build a serializable object"
        obj = {
            'name': node.name,
            'id': node.pk,
            'children': [serializable_object(ch) for ch in node.get_children()]
        }
        return obj
    

    Solution 4:

    def recursive_node_to_dict(node):
        result = {
            'name': node.name, 'id': node.pk
        }
        children = [recursive_node_to_dict(c) for c in node.get_children()],
        if children is not None:
            result['children'] = children
        return result
    
    from mptt.templatetags.mptt_tags import cache_tree_children
    root_nodes = cache_tree_children(root.get_descendants())
    dicts = []
    for n in root_nodes:
        dicts.append(recursive_node_to_dict(root_nodes[0]))
        jsonTree = json.dumps(dicts, indent=4)
    

    Solution 5 (make use of select_related to pre_fetch, whereas not sure if correctly used)

    def serializable_object_select_related(node):
        "Recurse into tree to build a serializable object, make use of select_related"
        obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(), 'id': node.pk, 'level': node.level, 'position': node.position, 'children': []}
        for child in node.get_children().select_related():
            obj['children'].append(serializable_object(child))
        return obj
    

    Solution 6 (improved solution 4, using caching of child nodes):

    def recursive_node_to_dict(node):
        return {
            'name': node.name, 'id': node.pk,
             # Notice the use of node._cached_children instead of node.get_children()
            'children' : [recursive_node_to_dict(c) for c in node._cached_children]
        }
    

    Called via:

    from mptt.templatetags.mptt_tags import cache_tree_children
    subTrees = cache_tree_children(root.get_descendants(include_self=True))
    subTreeDicts = []
    for subTree in subTrees:
        subTree = recursive_node_to_dict(subTree)
        subTreeDicts.append(subTree)
    jsonTree = json.dumps(subTreeDicts, indent=4)
    #optional clean up, remove the [ ] at the beginning and the end, its needed for D3.js
    jsonTree = jsonTree[1:len(jsonTree)]
    jsonTree = jsonTree[:len(jsonTree)-1]
    

    Below you can see the profiling results, created using cProfile as suggested by MuMind, setting up a Django view to start the stand-alone method profileJSON(), which in turn calls the different solutions to create the JSON output.

    def startProfileJSON(request):
        print "startProfileJSON"
        import cProfile
        cProfile.runctx('profileJSON()', globals=globals(), locals=locals())
        print "endProfileJSON"
    

    Results:

    Solution 1: 3350347 function calls (3130372 primitive calls) in 4.969 seconds (details)

    Solution 2: 2533705 function calls (2354516 primitive calls) in 3.630 seconds (details)

    Solution 3: 2533621 function calls (2354441 primitive calls) in 3.684 seconds (details)

    Solution 4: 2812725 function calls (2466028 primitive calls) in 3.840 seconds (details)

    Solution 5: 2536504 function calls (2357256 primitive calls) in 3.779 seconds (details)

    Solution 6 (Improved solution 4): 2593122 function calls (2299165 primitive calls) in 3.663 seconds (details)

    Discussion:

    Solution 1: own encoding implementation. bad idea

    Solution 2 + 3: currently the fastest, but still painfully slow

    Solution 4: looks promising with caching childs, but does perform similar and currently produces not valid json as childrens are put into double []:

    "children": [[]] instead of "children": []
    

    Solution 5: use of select_related does not make a difference, whereas probably used in the wrong way, as a node always have a ForeignKey to its parent, and we are parsing from root to child.

    Update: Solution 6: It looks like the cleanest solution to me, using caching of child nodes. But does only perform similar to solution 2 + 3. Which for me is strange.

    Anybody more ideas for performance improvements?

    • Samy Vilar
      Samy Vilar almost 12 years
    • Thomas Kremmel
      Thomas Kremmel almost 12 years
      Thanks for the hint. I went through it but could not get my head around it regarding the recursive serialization. But your hint made me search for serialization + tree + django and the first hit is this question, which looks like it is a solution to what I m looking for. stackoverflow.com/questions/5597136/… .. Will give it a try!
    • Mu Mind
      Mu Mind almost 12 years
      Looks like your main bottleneck in the first version of code above will be repeatedly appending to a string, which is known to be a very inefficient operation since it has to keep reallocating contiguous blocks of memory. Appending strings to a list and doing ''.join(pieces) at the end is going to perform much better.
    • Thomas Kremmel
      Thomas Kremmel almost 12 years
      Thanks for the hint. Will try it tomorrow. Time to rest.
    • Winston Ewert
      Winston Ewert almost 12 years
      Of all the nodes in your database, how many do you actually dump to JSON using this method?
    • Thomas Kremmel
      Thomas Kremmel almost 12 years
      For testing purposes currently 300. But in Production could be up to 1000 or more.
  • Thomas Kremmel
    Thomas Kremmel almost 12 years
    If I understand you right this is what I m doing in solution2. See the update to my question above.
  • Thomas Kremmel
    Thomas Kremmel almost 12 years
    Thanks! I checked the speedups and they seem to be there, using Python2.7: from _json import scanstring as c_scanstring .
  • Thomas Kremmel
    Thomas Kremmel almost 12 years
    Thanks. Could you check solution 4 in my code and propose how to prevent double lists for children? The children are currently packed into two [] instead of only one.
  • craigds
    craigds almost 12 years
    done. it had a trailing comma which python happily turned into a tuple :s
  • Thomas Kremmel
    Thomas Kremmel almost 12 years
    Thanks. I used your code, as it is the most elegant, and re-wrote it a bit. See solution 6 above. One important thing is to make use of node._cached_children instead of node.get_children() to make use of the caching done by cache_tree_children. Anyhow, the code does not perform better than solution 2, which is strange to me as your argument with hitting db only once is convincing. Any idea why it does not perform better?