Create a Pandas DataFrame from deeply nested JSON

python json pandas

15,973

I think organizing your data in way that yields repeating column names is only going to create headaches for you later on down the road. A better approach IMHO is to create a column for each of pivots, interval_id, and p_value. This will make extremely easy to query your data after loading it into pandas.

Also, your JSON has some errors in it. I ran it through this to find the errors.

jq helps here

import sh
jq = sh.jq.bake('-M')  # disable colorizing
json_data = "from above"
rule = """[{pivots: .intervals[].pivots, 
            interval_id: .intervals[].series[].interval_id,
            p_value: .intervals[].series[].p_value}]"""
out = jq(rule, _in=json_data).stdout
res = pd.DataFrame(json.loads(out))

This will yield output similar to

    interval_id       p_value      pivots
32            2  2.867501e-06  Jane Smith
33            2  1.000000e+00  Jane Smith
34            2  1.116279e-08  Jane Smith
35            2  2.867501e-06  Jane Smith
36            0  1.000000e+00   Bob Smith
37            0  1.116279e-08   Bob Smith
38            0  2.867501e-06   Bob Smith
39            0  1.000000e+00   Bob Smith
40            0  1.116279e-08   Bob Smith
41            0  2.867501e-06   Bob Smith
42            1  1.000000e+00   Bob Smith
43            1  1.116279e-08   Bob Smith

Adapted from this comment

Of course, you can always call res.drop_duplicates() to remove the duplicate rows. This gives

In [175]: res.drop_duplicates()
Out[175]:
    interval_id       p_value      pivots
0             0  1.000000e+00  Jane Smith
1             0  1.116279e-08  Jane Smith
2             0  2.867501e-06  Jane Smith
6             1  1.000000e+00  Jane Smith
7             1  1.116279e-08  Jane Smith
8             1  2.867501e-06  Jane Smith
12            2  1.000000e+00  Jane Smith
13            2  1.116279e-08  Jane Smith
14            2  2.867501e-06  Jane Smith
36            0  1.000000e+00   Bob Smith
37            0  1.116279e-08   Bob Smith
38            0  2.867501e-06   Bob Smith
42            1  1.000000e+00   Bob Smith
43            1  1.116279e-08   Bob Smith
44            1  2.867501e-06   Bob Smith
48            2  1.000000e+00   Bob Smith
49            2  1.116279e-08   Bob Smith
50            2  2.867501e-06   Bob Smith

[18 rows x 3 columns]

15,973

idclark

Updated on September 15, 2022

Comments

idclark over 1 year

I'm trying to create a single Pandas DataFrame object from a deeply nested JSON string.

The JSON schema is:

{"intervals": [
{
pivots: "Jane Smith",
"series": [
    {
        "interval_id": 0,
        "p_value": 1
       },
     {
         "interval_id": 1,
         "p_value": 1.1162791357932633e-8
     },
   {
        "interval_id": 2,
        "p_value": 0.0000028675012051504467
     }
    ],
   },
  {

"pivots": "Bob Smith",
  "series": [
       {
            "interval_id": 0,
            "p_value": 1
           },
         {
             "interval_id": 1,
            "p_value": 1.1162791357932633e-8
         },
       {
            "interval_id": 2,
            "p_value": 0.0000028675012051504467
         }
       ]
     }
    ]
 }

Desired Outcome I need to flatten this to produce a table:

Actor Interval_id Interval_id Interval_id ... 
Jane Smith      1         1.1162        0.00000 ... 
Bob Smith       1         1.1162        0.00000 ...

The first column is the Pivots values, and the remaining columns are the values of the keys interval_id and p_value stored in the list series.

So far i've got

import requests as r
import pandas as pd
actor_data = r.get("url/to/data").json['data']['intervals']
df = pd.DataFrame(actor_data)

actor_data is a list where the length is equal to the number of individuals ie pivots.values(). The df object simply returns

<bound method DataFrame.describe of  pivots             Series
0           Jane Smith  [{u'p_value': 1.0, u'interval_id': 0}, {u'p_va...
1           Bob Smith  [{u'p_value': 1.0, u'interval_id': 0}, {u'p_va...
.
.
.

How can I iterate through that series list to get to the dict values and create N distinct columns? Should I try to create a DataFrame for the series list, reshape it,and then do a column bind with the actor names?

UPDATE:

pvalue_list = [i['p_value'] for i in json_data['series']]

this gives me a list of lists. Now I need to figure out how to add each list as a row in a DataFrame.

value_list = []
for i in pvalue_list:
    pvs = [j['p_value'] for j in i]
    value_list = value_list.append(pvs)
return value_list

This returns a NoneType

Solution

def get_hypthesis_data():
    raw_data = r.get("/url/to/data").json()['data']
    actor_dict = {}
    for actor_series in raw_data['intervals']:
        actor = actor_series['pivots']
        p_values = []
        for interval in actor_series['series']:
            p_values.append(interval['p_value'])
        actor_dict[actor] = p_values
    return pd.DataFrame(actor_dict).T

This returns the correct DataFrame. I transposed it so the individuals were rows and not columns.

idclark about 10 years

wow good call on using jq! I feel like i'm getting closer. If I change the jq rule to rule = """[.intervals[].series]""" that will allow me to build a DataFrame with the correct N x K dimensions. The problem is that each cell is a dict eg {u'p_value': 1, u'interval_id': 0} instead of just the value of p_value.
Phillip Cloud about 10 years

That shouldn't be a problem for DataFrame.
idclark about 10 years

Is there a way to change the values of the cells so that they contain only the value of p_value and not the entire dictionary?
Phillip Cloud about 10 years

I'm not sure what you mean. Can you give me an example of what you're talking about?
idclark about 10 years

sure. I'm trying to build a DataFrame where the columns are time intervals and the rows are individuals. within each cell should be a single number (p_value) eg 0.5. Currently I have the correct dimensions; 22 rows (1 for each person) and 27 columns (1 for each time-period + 1 for person's name). The problem: for row Jane Smith at time_period 1 the value of the cell is {u'p_value: 0.034, u'interval_id: 1}. I'm trying to get the value of the cell to be 0.034 instead of the entire dict, for all time periods and individuals. Thanks for the feedback!
Anton Tarasenko almost 9 years

I'm trying to repeat this for a json file but can't properly set _in parameter for jq(). Is it possible to feed in a json file to sh?
mSSM almost 9 years

@Anton: Not sure what has changed since early 2014, but as far as I can tell jq does not accept a string as a command line argument. Instead, you can create a cat = sh.cat, and then pipe it into jq: jq(cat(_in=json_data), rule).stdout. See amoffat.github.io/sh/#piping
mSSM almost 9 years

The resulting stdout from jq has lots of \n in there and is actually a bytestring, but calling .decode('utf8') on it yields a str, which can then be easily passed to json.loads(), which in turn happily gives you a new dict.