How to flatten a pandas dataframe with some columns as json?

python json pandas dataframe flatten

61,136

Solution 1

Here's a solution using json_normalize() again by using a custom function to get the data in the correct format understood by json_normalize function.

import ast
from pandas.io.json import json_normalize

def only_dict(d):
    '''
    Convert json string representation of dictionary to a python dict
    '''
    return ast.literal_eval(d)

def list_of_dicts(ld):
    '''
    Create a mapping of the tuples formed after 
    converting json strings of list to a python list   
    '''
    return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])

A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.')
B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.')

Finally, join the DFs on the common index to get:

df[['id', 'name']].join([A, B])

EDIT:- As per the comment by @MartijnPieters, the recommended way of decoding the json strings would be to use json.loads() which is much faster when compared to using ast.literal_eval() if you know that the data source is JSON.

Solution 2

The quickest seems to be:

import pandas as pd
import json

json_struct = json.loads(df.to_json(orient="records"))    
df_flat = pd.io.json.json_normalize(json_struct) #use pd.io.json

Solution 3

TL;DR Copy-paste the following function and use it like this: flatten_nested_json_df(df)

This is the most general function I could come up with:

def flatten_nested_json_df(df):

    df = df.reset_index()

    print(f"original shape: {df.shape}")
    print(f"original columns: {df.columns}")


    # search for columns to explode/flatten
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()

    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    print(f"lists: {list_columns}, dicts: {dict_columns}")
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            print(f"flattening: {col}")
            # explode dictionaries horizontally, adding new columns
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            # explode lists vertically, adding new columns
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        # check if there are still dict o list fields to flatten
        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()

        print(f"lists: {list_columns}, dicts: {dict_columns}")

    print(f"final shape: {df.shape}")
    print(f"final columns: {df.columns}")
    return df

It takes a dataframe that may have nested lists and/or dicts in its columns, and recursively explodes/flattens those columns.

It uses pandas' pd.json_normalize to explode the dictionaries (creating new columns), and pandas' explode to explode the lists (creating new rows).

Simple to use:

# Test
df = pd.DataFrame(
    columns=['id','name','columnA','columnB'],
    data=[
        [1,'John',{"dist": "600", "time": "0:12.10"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]],
        [2,'Mike',{"dist": "600"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]]
    ])

flatten_nested_json_df(df)

It's not the most efficient thing on earth, and it has the side effect of resetting your dataframe's index, but it gets the job done. Feel free to tweak it.

Solution 4

create a custom function to flatten columnB then use pd.concat

def flatten(js):
    return pd.DataFrame(js).set_index('pos').squeeze()

pd.concat([df.drop(['columnA', 'columnB'], axis=1),
           df.columnA.apply(pd.Series),
           df.columnB.apply(flatten)], axis=1)

View more solutions

61,136

Author by

sfactor

Dreamer, Analyst, Engineer, Programmer, Photographer.

Updated on July 17, 2022

Comments

sfactor almost 2 years

I have a dataframe df that loads data from a database. Most of the columns are json strings while some are even list of jsons. For example:

id     name     columnA                               columnB
1     John     {"dist": "600", "time": "0:12.10"}    [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]
2     Mike     {"dist": "600"}                       [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]
...

As you can see, not all the rows have the same number of elements in the json strings for a column.

What I need to do is keep the normal columns like id and name as it is and flatten the json columns like so:

id    name   columnA.dist   columnA.time   columnB.pos.1st   columnB.pos.2nd   columnB.pos.3rd     columnB.pos.total
1     John   600            0:12.10        500               300               200                 1000 
2     Mark   600            NaN            500               300               Nan                 800

I have tried using json_normalize like so:

from pandas.io.json import json_normalize
json_normalize(df)

But there seems to be some problems with keyerror. What is the correct way of doing this?

sfactor over 7 years

Great thanks for the answer ! one thing though, is the returned lists on the list_of_dicts (list(d.values())[0], list(d.values())[1]), and not the other way round? Otherwise this worked perfect for me.
Nickil Maveli over 7 years

As you would know that dictionaries do not preserve the order while performing iteration, the values present in the dict were appearing in the order opposite to that of yours and hence was the need to use the slicing notation differently compared to yours. If it's appearing in the same order as you've mentioned, go ahead with it or you can even make use of an Ordered Dict to preserve the order if you want to.
Martijn Pieters over 6 years

Why the (slow!) ast.literal_eval() call when you should be using json.loads()? The latter handles correct JSON data, the former only Python syntax, which differs materially when it comes to booleans, nulls and unicode data outside of the BMP.
Nickil Maveli over 6 years

@MartijnPieters: Thanks for the comment. I've updated my post.
Martijn Pieters over 6 years

Not only is it faster, it'll also avoid ValueError exceptions when true, false or null values are involved. JSON is not Python.
Durga Gaddam about 6 years

@NickilMaveli how do I use this if I have nan value in column A, I am getting an error malformed node or string: nan when trying to use the only_dict function, but when I remove the 'nan' values from the column A it is fine. Thank you
E. Zeytinci over 4 years

If your data contains null, you can update the only_dict method to: return ast.literal_eval(d) if pd.notnull(d) else {} Otherwise, it returns ValueError: malformed node or string: nan
TheTiGuR about 4 years

This was definitely the simplest method and the one that worked for me. Only caveat is your nested objects will end up with long names (data.level1.level2.level3 ...etc)
Atlas7 over 3 years

This is definitely my chosen answer - works perfectly and very concise solution. Thanks!
Serge de Gosson de Varennes almost 3 years

This is BY FAR the best solution I have seen in a long time! Good job!
Cameron Stewart over 2 years

Hi there, this is helpful, but doesnt seem to save the new dataframe
Michele Piccolini over 2 years

@CameronStewart save where?
Atharv Thakur almost 2 years

It gives me error msg.format(req_len=len(left.columns), given_len=len(right)) ValueError: Unable to coerce to Series, length must be 44: given 1
Srivatsan almost 2 years

This is the best answer!