Pandas json_normalize produces confusing `KeyError` message?

12,500

Solution 1

In this case, I think you'd just use this:

In [57]: json_normalize(data[0]['events'])
Out[57]: 
  group  schedule.ID schedule.date schedule.location.building  \
0     A          815    2015-08-27                        BDC   
1     A          816    2015-08-27                        BDC   

   schedule.location.floor  
0                        5  
1                        5  

The meta paths ([['schedule','date']...]) are for specifying data at the same level of nesting as your records, i.e. at the same level as 'events'. It doesn't look like json_normalize handles dicts with nested lists particularly well, so you may need to do some manual reshaping if your actual data is much more complicated.

Solution 2

I got the KeyError when the structue of the json was not consistent. Meaning, when one of the nested strucutes were missing from the json, I got KeyError.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html

From the examples mentioned on the pandas documentation site, if you make the nested tag (counties) missing on one of the records, you will get a KeyError. To circumvent this, you might have to make sure ignore the missing tag or consider only the records which have nested column/tag populated with data.

Solution 3

I had this same problem! This thread helped, especially parachute py's answer.

I found a solution using:

df.dropna(subset = *column(s) with nested data*)

then saving the resultant df as a new json. Load the new json and now you'll be able to flatten the nested columns.

There's probably a more efficient way to get around this, but my solution works.

edit: forgot to mention, I tried using the *errors = 'ignore'* arg in json.normalize() and it didn't help.

Share:
12,500
themachinist
Author by

themachinist

Updated on July 25, 2022

Comments

  • themachinist
    themachinist almost 2 years

    I'm trying to convert a nested JSON to a Pandas dataframe. I've been using json_normalize with success until I came across a certain JSON. I've made a smaller version of it to recreate the problem.

    from pandas.io.json import json_normalize
    
    json=[{"events": [{"schedule": {"date": "2015-08-27",
         "location": {"building": "BDC", "floor": 5},
         "ID": 815},
        "group": "A"},
       {"schedule": {"date": "2015-08-27",
         "location": {"building": "BDC", "floor": 5},
     "ID": 816},
    "group": "A"}]}]
    

    I then run:

    json_normalize(json[0],'events',[['schedule','date'],['schedule','location','building'],['schedule','location','floor']])
    

    Expecting to see something like this:

    ID      group   schedule.date   schedule.location.building schedule.location.floor  
    '815'   'A'     '2015-08-27'            'BDC'                       5
    '816'   'A'     '2015-08-27'            'BDC'                       5
    

    But instead I get this error:

    In [2]: json_normalize(json[0],'events',[['schedule','date'],['schedule','location','building'],['schedule','location','floor']])
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-2-b588a9e3ef1d> in <module>()
    ----> 1 json_normalize(json[0],'events',[['schedule','date'],['schedule','location','building'],['schedule','location','floor']])
    
    /Users/logan/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/json.pyc in json_normalize(data, record_path, meta, meta_prefix, record_prefix)
        739                 records.extend(recs)
        740
    --> 741     _recursive_extract(data, record_path, {}, level=0)
        742
        743     result = DataFrame(records)
    
    /Users/logan/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/json.pyc in _recursive_extract(data, path, seen_meta, level)
        734                         meta_val = seen_meta[key]
        735                     else:
    --> 736                         meta_val = _pull_field(obj, val[level:])
        737                     meta_vals[key].append(meta_val)
        738
    
    /Users/logan/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/json.pyc in _pull_field(js, spec)
        674         if isinstance(spec, list):
        675             for field in spec:
    --> 676                 result = result[field]
        677         else:
        678             result = result[spec]
    
    KeyError: 'schedule'
    
  • devanathan
    devanathan over 7 years
    is there any way to get instead of schedule.location.floor as floor
  • Arthur Zangiev
    Arthur Zangiev about 7 years
    you can always rename columns by .rename(columns={'schedule.location.floor':'floor'})