Flatten double nested JSON

12,404

Solution 1

This is one way to do it. Should give you some ideas.

df = pd.concat(
    [
        pd.concat([pd.Series(m) for m in t['members']], axis=1) for t in data['teams']
    ], keys=[t['teamname'] for t in data['teams']]
)

                                     0                         1
1 email          [email protected]     [email protected]
  firstname                       John                      Jane
  lastname                         Doe                       Doe
  mobile                                            916-555-7890
  orgname                         Anon                      Anon
  phone                   916-555-1234              916-555-4321
2 email      [email protected]  [email protected]
  firstname                     Mickey                     Minny
  lastname                       Moose                     Moose
  mobile                  916-555-1111                          
  orgname                      Moosers                   Moosers
  phone                   916-555-0000              916-555-2222

To get a nice table with team name and members as rows, all attributes in columns:

df.index.levels[0].name = 'teamname'
df.columns.name = 'member'

df.T.stack(0).swaplevel(0, 1).sort_index()

enter image description here

To get team name and member as actual columns, just reset the index.

df.index.levels[0].name = 'teamname'
df.columns.name = 'member'

df.T.stack(0).swaplevel(0, 1).sort_index().reset_index()

enter image description here

The whole thing

import json
import pandas as pd

json_text = """{
"teams": [
  {
    "teamname": "1",
    "members": [
      {
        "firstname": "John", 
        "lastname": "Doe",
        "orgname": "Anon",
        "phone": "916-555-1234",
        "mobile": "",
        "email": "[email protected]"
      },
      {
        "firstname": "Jane",
        "lastname": "Doe",
        "orgname": "Anon",
        "phone": "916-555-4321",
        "mobile": "916-555-7890",
        "email": "[email protected]"
      }
    ]
  },
  {
    "teamname": "2",
    "members": [
      {
        "firstname": "Mickey",
        "lastname": "Moose",
        "orgname": "Moosers",
        "phone": "916-555-0000",
        "mobile": "916-555-1111",
        "email": "[email protected]"
      },
      {
        "firstname": "Minny",
        "lastname": "Moose",
        "orgname": "Moosers",
        "phone": "916-555-2222",
        "mobile": "",
        "email": "[email protected]"
      }
    ]
  }       
]
}"""


data = json.loads(json_text)

df = pd.concat(
    [
        pd.concat([pd.Series(m) for m in t['members']], axis=1) for t in data['teams']
    ], keys=[t['teamname'] for t in data['teams']]
)

df.index.levels[0].name = 'teamname'
df.columns.name = 'member'

df.T.stack(0).swaplevel(0, 1).sort_index().reset_index()

Solution 2

Use pandas.io.json.json_normalize

json_normalize(data,record_path=['teams','members'],meta=[['teams','teamname']])

output:
         email                firstname lastname mobile      orgname    phone       teams.teamname
0   [email protected]       John    Doe                   Anon      916-555-1234    1
1   [email protected]       Jane    Doe     916-555-7890  Anon      916-555-4321    1
2   [email protected]   Mickey  Moose   916-555-1111  Moosers   916-555-0000    2
3   [email protected]    Minny   Moose                 Moosers   916-555-2222    2


Explanation

from pandas.io.json import json_normalize
import pandas as pd

I've only learned how to use the json_normalize function recently so my explanation might not be right.

Start with what I'm calling 'Layer 0'

json_normalize(data)

output:
     teams
0   [{'teamname': '1', 'members': [{'firstname': '...

There is 1 Column and 1 Row. Everything is inside the 'team' column.

Look into what I'm calling 'Layer 1' by using record_path=

json_normalize(data,record_path='teams')

output:
     members                                          teamname
0   [{'firstname': 'John', 'lastname': 'Doe', 'org...    1
1   [{'firstname': 'Mickey', 'lastname': 'Moose', ...    2

In Layer 1 we have have flattened 'teamname' but there is more inside 'members'.

Look into Layer 2 with record_path=. The notation is unintuitive at first. I now remember it by ['layer','deeperlayer'] where the result is layer.deeperlayer.

json_normalize(data,record_path=['teams','members'])

output:
           email              firstname lastname   mobile     orgname   phone
0   [email protected]      John        Doe                  Anon    916-555-1234
1   [email protected]       Jane        Doe   916-555-7890  Anon    916-555-4321
2   [email protected]   Mickey     Moose   916-555-1111 Moosers 916-555-0000
3   [email protected]    Minny       Moose               Moosers 916-555-2222

Excuse my output, I don't know how to make tables in a response.

Finally we add in Layer 1 columns using meta=

json_normalize(data,record_path=['teams','members'],meta=[['teams','teamname']])

output:
         email                firstname lastname mobile      orgname    phone       teams.teamname
0   [email protected]       John    Doe                   Anon      916-555-1234    1
1   [email protected]       Jane    Doe     916-555-7890  Anon      916-555-4321    1
2   [email protected]   Mickey  Moose   916-555-1111  Moosers   916-555-0000    2
3   [email protected]    Minny   Moose                 Moosers   916-555-2222    2

Notice how we needed a list of lists for meta=[[]] to reference Layer 1. If there was a column we want from Layer 0 and Layer 1 we could do this:

json_normalize(data,record_path=['layer1','layer2'],meta=['layer0',['layer0','layer1']])

The result of the json_normalize is a pandas dataframe.

Share:
12,404
spaine
Author by

spaine

GIS specialist and python programmer at Dept. of Fish & Wildlife.

Updated on July 16, 2022

Comments

  • spaine
    spaine almost 2 years

    I am trying to flatten a JSON file that looks like this:

    {
    "teams": [
      {
        "teamname": "1",
        "members": [
          {
            "firstname": "John", 
            "lastname": "Doe",
            "orgname": "Anon",
            "phone": "916-555-1234",
            "mobile": "",
            "email": "[email protected]"
          },
          {
            "firstname": "Jane",
            "lastname": "Doe",
            "orgname": "Anon",
            "phone": "916-555-4321",
            "mobile": "916-555-7890",
            "email": "[email protected]"
          }
        ]
      },
      {
        "teamname": "2",
        "members": [
          {
            "firstname": "Mickey",
            "lastname": "Moose",
            "orgname": "Moosers",
            "phone": "916-555-0000",
            "mobile": "916-555-1111",
            "email": "[email protected]"
          },
          {
            "firstname": "Minny",
            "lastname": "Moose",
            "orgname": "Moosers",
            "phone": "916-555-2222",
            "mobile": "",
            "email": "[email protected]"
          }
        ]
      }       
    ]
    

    }

    I wish to export this to an excel table. My current code is this:

    from pandas.io.json import json_normalize
    import json
    import pandas as pd
    
    inputFile = 'E:\\teams.json'
    outputFile = 'E:\\teams.xlsx'
    
    f = open(inputFile)
    data = json.load(f)
    f.close()
    
    df = pd.DataFrame(data)
    
    result1 = json_normalize(data, 'teams' )
    print result1
    

    results in this output:

    members                                              teamname
    0  [{u'firstname': u'John', u'phone': u'916-555-...        1
    1  [{u'firstname': u'Mickey', u'phone': u'916-555-...      2
    

    There are 2 members's data nested within each row. I would like to have an output table that displays all 4 members' data plus their associated teamname.

  • spaine
    spaine almost 8 years
    Thank you, but what I am trying to achieve is a table that has names, email, etc as column headings with the team number as an additional column heading. That way the four people are represented by four flattened records.
  • piRSquared
    piRSquared almost 8 years
    This was intended to show you the sort of techniques that can be used to accomplish your goal. To get what you want, you now need to pivot. I'll update my answer to reflect this.
  • spaine
    spaine almost 8 years
    The example you've created looks perfect. When I run this the following error gets returned at the df.index.levels[0].name = 'teamname' line: AttributeError: 'Int64Index' object has no attribute 'levels'
  • piRSquared
    piRSquared almost 8 years
    That's my bad. I edited the post. I forgot to assign the pd.concat to the dataframe df
  • spaine
    spaine almost 8 years
    This is very helpful to me - I'm going to have to do some research to understand exactly how your code works. May I ask how you produced the data in table form that you posted above?
  • piRSquared
    piRSquared almost 8 years
    This solution is very specific to the json structure you provided. The html table is output while using jupyter-notebook. You can access the same html via df.to_html().
  • spaine
    spaine almost 8 years
    Hmm, when I include either or both of the last 2 scripts to turn it into a table, it remains as you show after pd.concat...
  • piRSquared
    piRSquared almost 8 years
    Ok, I was basing my code on an assumption that your data was the same as mine. I'll update the code again with the exact thing I ran.
  • spaine
    spaine almost 8 years
    I was using the same data, just loading from a .json file. Even if I copy the entire text of the code into a python editor, printing the result displays what you show after pd.concat...
  • piRSquared
    piRSquared almost 8 years
    Are you looking at df or df.T.stack(0).swaplevel(0, 1).sort_index().reset_index()? The latter doesn't reassign to df. So if you look at df after you've run df.T.stack(0).swaplevel(0, 1).sort_index().reset_index(), df will still look like it did after the pd.concat
  • spaine
    spaine almost 8 years
    That was it, I thought df had been reassigned. I assigned df.T.stack(0).swaplevel(0, 1).sort_index().reset_index() as an object and the export to Excel worked perfectly. Thank you very much.
  • Datacrawler
    Datacrawler almost 5 years
    @piRSquared Amazing solution. Hopefully, I can make it work when, for example, the phone has many entries e.g. "phone": [{"home": "1111", "mobile": "2222"},{"home": "3333", "mobile": "44444"}