Create a Pandas DataFrame from deeply nested JSON
I think organizing your data in way that yields repeating column names is only going to create headaches for you later on down the road. A better approach IMHO is to create a column for each of pivots
, interval_id
, and p_value
. This will make extremely easy to query your data after loading it into pandas.
Also, your JSON has some errors in it. I ran it through this to find the errors.
jq
helps here
import sh
jq = sh.jq.bake('-M') # disable colorizing
json_data = "from above"
rule = """[{pivots: .intervals[].pivots,
interval_id: .intervals[].series[].interval_id,
p_value: .intervals[].series[].p_value}]"""
out = jq(rule, _in=json_data).stdout
res = pd.DataFrame(json.loads(out))
This will yield output similar to
interval_id p_value pivots
32 2 2.867501e-06 Jane Smith
33 2 1.000000e+00 Jane Smith
34 2 1.116279e-08 Jane Smith
35 2 2.867501e-06 Jane Smith
36 0 1.000000e+00 Bob Smith
37 0 1.116279e-08 Bob Smith
38 0 2.867501e-06 Bob Smith
39 0 1.000000e+00 Bob Smith
40 0 1.116279e-08 Bob Smith
41 0 2.867501e-06 Bob Smith
42 1 1.000000e+00 Bob Smith
43 1 1.116279e-08 Bob Smith
Adapted from this comment
Of course, you can always call res.drop_duplicates()
to remove the duplicate rows. This gives
In [175]: res.drop_duplicates()
Out[175]:
interval_id p_value pivots
0 0 1.000000e+00 Jane Smith
1 0 1.116279e-08 Jane Smith
2 0 2.867501e-06 Jane Smith
6 1 1.000000e+00 Jane Smith
7 1 1.116279e-08 Jane Smith
8 1 2.867501e-06 Jane Smith
12 2 1.000000e+00 Jane Smith
13 2 1.116279e-08 Jane Smith
14 2 2.867501e-06 Jane Smith
36 0 1.000000e+00 Bob Smith
37 0 1.116279e-08 Bob Smith
38 0 2.867501e-06 Bob Smith
42 1 1.000000e+00 Bob Smith
43 1 1.116279e-08 Bob Smith
44 1 2.867501e-06 Bob Smith
48 2 1.000000e+00 Bob Smith
49 2 1.116279e-08 Bob Smith
50 2 2.867501e-06 Bob Smith
[18 rows x 3 columns]
Related videos on Youtube
idclark
Updated on September 15, 2022Comments
-
idclark over 1 year
I'm trying to create a single Pandas DataFrame object from a deeply nested JSON string.
The JSON schema is:
{"intervals": [ { pivots: "Jane Smith", "series": [ { "interval_id": 0, "p_value": 1 }, { "interval_id": 1, "p_value": 1.1162791357932633e-8 }, { "interval_id": 2, "p_value": 0.0000028675012051504467 } ], }, { "pivots": "Bob Smith", "series": [ { "interval_id": 0, "p_value": 1 }, { "interval_id": 1, "p_value": 1.1162791357932633e-8 }, { "interval_id": 2, "p_value": 0.0000028675012051504467 } ] } ] }
Desired Outcome I need to flatten this to produce a table:
Actor Interval_id Interval_id Interval_id ... Jane Smith 1 1.1162 0.00000 ... Bob Smith 1 1.1162 0.00000 ...
The first column is the
Pivots
values, and the remaining columns are the values of the keysinterval_id
andp_value
stored in the listseries
.So far i've got
import requests as r import pandas as pd actor_data = r.get("url/to/data").json['data']['intervals'] df = pd.DataFrame(actor_data)
actor_data
is a list where the length is equal to the number of individuals iepivots.values()
. The df object simply returns<bound method DataFrame.describe of pivots Series 0 Jane Smith [{u'p_value': 1.0, u'interval_id': 0}, {u'p_va... 1 Bob Smith [{u'p_value': 1.0, u'interval_id': 0}, {u'p_va... . . .
How can I iterate through that
series
list to get to the dict values and create N distinct columns? Should I try to create a DataFrame for theseries
list, reshape it,and then do a column bind with the actor names?UPDATE:
pvalue_list = [i['p_value'] for i in json_data['series']]
this gives me a list of lists. Now I need to figure out how to add each list as a row in a DataFrame.
value_list = [] for i in pvalue_list: pvs = [j['p_value'] for j in i] value_list = value_list.append(pvs) return value_list
This returns a NoneType
Solution
def get_hypthesis_data(): raw_data = r.get("/url/to/data").json()['data'] actor_dict = {} for actor_series in raw_data['intervals']: actor = actor_series['pivots'] p_values = [] for interval in actor_series['series']: p_values.append(interval['p_value']) actor_dict[actor] = p_values return pd.DataFrame(actor_dict).T
This returns the correct DataFrame. I transposed it so the individuals were rows and not columns.
-
idclark about 10 yearswow good call on using
jq
! I feel like i'm getting closer. If I change thejq
rule torule = """[.intervals[].series]"""
that will allow me to build a DataFrame with the correct N x K dimensions. The problem is that each cell is a dict eg{u'p_value': 1, u'interval_id': 0}
instead of just the value ofp_value
. -
Phillip Cloud about 10 yearsThat shouldn't be a problem for
DataFrame
. -
idclark about 10 yearsIs there a way to change the values of the cells so that they contain only the value of
p_value
and not the entire dictionary? -
Phillip Cloud about 10 yearsI'm not sure what you mean. Can you give me an example of what you're talking about?
-
idclark about 10 yearssure. I'm trying to build a DataFrame where the columns are time intervals and the rows are individuals. within each cell should be a single number (p_value) eg 0.5. Currently I have the correct dimensions; 22 rows (1 for each person) and 27 columns (1 for each time-period + 1 for person's name). The problem: for row
Jane Smith
attime_period 1
the value of the cell is{u'p_value: 0.034, u'interval_id: 1}
. I'm trying to get the value of the cell to be0.034
instead of the entire dict, for all time periods and individuals. Thanks for the feedback! -
Anton Tarasenko almost 9 yearsI'm trying to repeat this for a
json
file but can't properly set_in
parameter forjq()
. Is it possible to feed in ajson
file tosh
? -
mSSM almost 9 years@Anton: Not sure what has changed since early 2014, but as far as I can tell
jq
does not accept a string as a command line argument. Instead, you can create acat = sh.cat
, and then pipe it intojq
:jq(cat(_in=json_data), rule).stdout
. See amoffat.github.io/sh/#piping -
mSSM almost 9 yearsThe resulting
stdout
fromjq
has lots of\n
in there and is actually a bytestring, but calling.decode('utf8')
on it yields astr
, which can then be easily passed tojson.loads()
, which in turn happily gives you a newdict
.