Convert CSV to JSON (in specific format) using Python

15,921

Since you are using pandas already, I tried to get as much mileage as I could out of dataframe methods. I also ended up wandering fairly far afield from your implementation. I think the key here, though, is don't try to get too clever with your list and/or dictionary comprehensions. You can very easily confuse yourself and everyone who reads your code.

import pandas as pd
from itertools import groupby 
from collections import OrderedDict
import json    

df = pd.read_csv('SampleCsvFile.csv', dtype={
            "zipcode" : str,
            "date" : str,
            "state" : str,
            "val1" : str,
            "val2" : str,
            "val3" : str,
            "val4" : str,
            "val5" : str
        })

results = []

for (zipcode, state), bag in df.groupby(["zipcode", "state"]):
    contents_df = bag.drop(["zipcode", "state"], axis=1)
    subset = [OrderedDict(row) for i,row in contents_df.iterrows()]
    results.append(OrderedDict([("zipcode", zipcode),
                                ("state", state),
                                ("subset", subset)]))

print json.dumps(results[0], indent=4)
#with open('ExpectedJsonFile.json', 'w') as outfile:
#    outfile.write(json.dumps(results[0], indent=4))

The simplest way to have all the json datatypes written as strings, and to retain their original formatting, was to force read_csv to parse them as strings. If, however, you need to do any numerical manipulation on the values before writing out the json, you will have to allow read_csv to parse them numerically and coerce them into the proper string format before converting to json.

Share:
15,921
Steve
Author by

Steve

Updated on June 05, 2022

Comments

  • Steve
    Steve almost 2 years

    I would like to convert a csv file into a json file using python 2.7. Down below is the python code I tried but it is not giving me expected result. Also, I would like to know if there is any simplified version than mine. Any help is appreciated.

    Here is my csv file (SampleCsvFile.csv):

    zipcode,date,state,val1,val2,val3,val4,val5
    95110,2015-05-01,CA,50,30.00,5.00,3.00,3
    95110,2015-06-01,CA,67,31.00,5.00,3.00,4
    95110,2015-07-01,CA,97,32.00,5.00,3.00,6
    

    Here is the expected json file (ExpectedJsonFile.json):

    {
            "zipcode": "95110", 
            "state": "CA", 
            "subset": [
                {
                    "date": "2015-05-01",
                    "val1": "50", 
                    "val2": "30.00", 
                    "val3": "5.00", 
                    "val4": "3.00", 
                    "val5": "3"
                }, 
                {
                    "date": "2015-06-01", 
                    "val1": "67", 
                    "val2": "31.00", 
                    "val3": "5.00", 
                    "val4": "3.00", 
                    "val5": "4"
                }, 
                {
                    "date": "2015-07-01", 
                    "val1": "97", 
                    "val2": "32.00", 
                    "val3": "5.00", 
                    "val4": "3.00", 
                    "val5": "6"
                }
            ]
    
    }
    

    Here's the python code I tried:

    import pandas as pd
    from itertools import groupby 
    import json    
    
    df = pd.read_csv('SampleCsvFile.csv')
    
    names = df.columns.values.tolist()
    data = df.values
    
    master_list2 = [ (d["zipcode"], d["state"], d) for d in [dict(zip(names, d)) for d in data] ]
    intermediate2 = [(k, [x[2] for x in list(v)]) for k,v in groupby(master_list2, lambda t: (t[0],t[1]) )]
    nested_json2 = [dict(zip(names,(k[0][0], k[0][1], k[1]))) for k in [(i[0], i[1]) for i in intermediate2]]
    
    #print json.dumps(nested_json2, indent=4)
    with open('ExpectedJsonFile.json', 'w') as outfile:
         outfile.write(json.dumps(nested_json2, indent=4))
    
  • Steve
    Steve over 8 years
    When I try to execute your code, I get the type error: 95110 is not JSON serializable. Also, the output doesn't have double quotes to numerical values and truncates the decimal portion. Any idea on how to resolve it?
  • Steve
    Steve over 8 years
    Here's the output of "print result" (without the json.dumps): {'subset': [{'val3': 5.0, 'date': '2015-05-01', 'val2': 30.0, 'val1': 50L, 'val5': 3L, 'val4': 3.0}, {'val3': 5.0, 'date': '2015-06-01', 'val2': 31.0, 'val1': 67L, 'val5': 4L, 'val4': 3.0}, {'val3': 5.0, 'date': '2015-07-01', 'val2': 32.0, 'val1': 97L, 'val5': 6L, 'val4': 3.0}], 'state': 'CA', 'zipcode': 95110}
  • Aniket Schneider
    Aniket Schneider over 8 years
    I adjusted the answer to force all the datatypes to be read as strings, which will resolve the quoting and numerical formatting issues. I'm not sure about the type error you're experiencing, as I can't replicate that, but since the datatype of the parsed zipcode is different now there is a chance that will be resolved as well.
  • Steve
    Steve over 8 years
    New code works like a charm. Thanks a lot Aniket, I really appreciate it.
  • Steve
    Steve over 8 years
    one question though, is there a way to change/set the order of elements? For example, I would like to see state and zipcode on the top and subset in the bottom. Right now, the order seems to be random.
  • Aniket Schneider
    Aniket Schneider over 8 years
    Updated the answer to force ordering. Note that json objects are considered unordered collections, so anything that parses json should not care about the ordering. However, for human readability, you can use the python OrderedDict to set the order you want.
  • Steve
    Steve over 8 years
    Understood. Thanks again!
  • Aniket Schneider
    Aniket Schneider over 8 years
    Don't forget to accept the answer if it has solved your problem! :)
  • Steve
    Steve over 8 years
    just did. :) Thanks for the remainder.
  • DUDANF
    DUDANF almost 5 years
    One question! I'm hoping someone still replies. So you know how you've created a subset? Basically this is the result of my subset: ``` "Serial_numbers": [ { "For Serial": "0018F", "Rev Serial": "0018R" } ] ``` I want this to simply be ``` "Serial Numbers": [ { "0018F -0018R" : {"ADD SOME HARDCODED VALUES"} } ] ``` Anyone know how I can use OrderedDict and bag to do this?