Split JSON file in equal/smaller parts with Python

15,813

Use an iteration grouper; the itertools module recipes list includes the following:

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

This lets you iterate over your tweets in groups of 5000:

for i, group in enumerate(grouper(input_tweets, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)
Share:
15,813
Tom
Author by

Tom

Updated on August 07, 2022

Comments

  • Tom
    Tom over 1 year

    I am currently working on a project where I use Sentiment Analysis for Twitter Posts. I am classifying the Tweets with Sentiment140. With the tool I can classify up to 1,000,000 Tweets per day and I have collected around 750,000 Tweets. So that should be fine. The only problem is that I can send a max of 15,000 Tweets to the JSON Bulk Classification at once.

    My whole code is set up and running. The only problem is that my JSON file now contains all 750,000 Tweets.

    Therefore my question: What is the best way to split the JSON into smaller files with the same structure? I would prefer to do this in Python.

    I have thought about iterating through the file. But how do I specify in the code that it should create a new file after for example 5,000 elements?

    I would love to get some hints what the most reasonable approach is. Thank you!

    EDIT: This is the code that I have at the moment.

    import itertools
    import json
    from itertools import izip_longest
    
    def grouper(iterable, n, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return izip_longest(fillvalue=fillvalue, *args)
    
    # Open JSON file
    values = open('Tweets.json').read()
    #print values
    
    # Adjust formatting of JSON file
    values = values.replace('\n', '')    # do your cleanup here
    #print values
    
    v = values.encode('utf-8')
    #print v
    
    # Load JSON file
    v = json.loads(v)
    print type(v)
    
    for i, group in enumerate(grouper(v, 5000)):
        with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
            json.dump(list(group), outputfile)
    

    The output gives:

    ["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]
    

    in a file called: "outputbatch_0.json"

    EDIT 2: This is the structure of the JSON.

    {
    "data": [
    {
    "text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
    "id": "1"
    },
    {
    "text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
    "id": "2"
    },
    {
    "text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
    "id": "3"}
    ]
    }
    
  • Tom
    Tom almost 11 years
    Looks really interesting. But I am struggling with the different parts of the code. I guess I read the json file first with: "values = open('Tweets.json').read()". Can you elaborate a bit on the different parameters? Thanks
  • Tom
    Tom almost 11 years
    ok. Thanks! What is the grouper and the input_tweets parameters?
  • Martijn Pieters
    Martijn Pieters almost 11 years
    grouper() is the function I name above it, input_tweets your sequence of 750.000 tweets.
  • Tom
    Tom almost 11 years
    Ah ok - sorry. Do I need to define the izip_longest function as well or should it be included in the module? My code says: "NameError: global name 'izip_longest' is not defined."
  • Martijn Pieters
    Martijn Pieters almost 11 years
    Sorry, the included recipe comes from the itertools module and the izip_longest function is one you import from that module. I included an explicit import in the sample above now.
  • Tom
    Tom almost 11 years
    Sorry for my bad questions but this code is really complicated for me. I have updated the question to show you my current results. The output is not really meaningful...
  • Martijn Pieters
    Martijn Pieters almost 11 years
    You have only one object in input_tweets in that case. The null values are None values returned by the grouper to pad the list up to 5000 elements.
  • Tom
    Tom almost 11 years
    I have just included the structure of the JSON file. It has round about 750,000 elements. With objects you mean the content from the JSON?
  • Martijn Pieters
    Martijn Pieters almost 11 years
    Use just the data element; loop over jsoncontainer['data'] instead. That is your list of tweets.
  • Tom
    Tom almost 11 years
    When I delete the "{"data": }" it works. But I need this part in the beginning of every file as it is required by the API.
  • Martijn Pieters
    Martijn Pieters almost 11 years
    Then create new dictionary and save that. json.dump({'data': list(group)}, outputfile)
  • Tom
    Tom almost 11 years
    This is it. Thank you so much. Sorry for the inconvenience with me.