Split JSON file in equal/smaller parts with Python

python json api split

15,813

Use an iteration grouper; the itertools module recipes list includes the following:

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

This lets you iterate over your tweets in groups of 5000:

for i, group in enumerate(grouper(input_tweets, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

15,813

Author by

Tom

Updated on August 07, 2022

Comments

Tom over 1 year

I am currently working on a project where I use Sentiment Analysis for Twitter Posts. I am classifying the Tweets with Sentiment140. With the tool I can classify up to 1,000,000 Tweets per day and I have collected around 750,000 Tweets. So that should be fine. The only problem is that I can send a max of 15,000 Tweets to the JSON Bulk Classification at once.

My whole code is set up and running. The only problem is that my JSON file now contains all 750,000 Tweets.

Therefore my question: What is the best way to split the JSON into smaller files with the same structure? I would prefer to do this in Python.

I have thought about iterating through the file. But how do I specify in the code that it should create a new file after for example 5,000 elements?

I would love to get some hints what the most reasonable approach is. Thank you!

EDIT: This is the code that I have at the moment.

import itertools
import json
from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# Open JSON file
values = open('Tweets.json').read()
#print values

# Adjust formatting of JSON file
values = values.replace('\n', '')    # do your cleanup here
#print values

v = values.encode('utf-8')
#print v

# Load JSON file
v = json.loads(v)
print type(v)

for i, group in enumerate(grouper(v, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

The output gives:

["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]

in a file called: "outputbatch_0.json"

EDIT 2: This is the structure of the JSON.

{
"data": [
{
"text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
"id": "1"
},
{
"text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
"id": "2"
},
{
"text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
"id": "3"}
]
}

Tom almost 11 years

Looks really interesting. But I am struggling with the different parts of the code. I guess I read the json file first with: "values = open('Tweets.json').read()". Can you elaborate a bit on the different parameters? Thanks
Tom almost 11 years

ok. Thanks! What is the grouper and the input_tweets parameters?
Martijn Pieters almost 11 years

grouper() is the function I name above it, input_tweets your sequence of 750.000 tweets.
Tom almost 11 years

Ah ok - sorry. Do I need to define the izip_longest function as well or should it be included in the module? My code says: "NameError: global name 'izip_longest' is not defined."
Martijn Pieters almost 11 years

Sorry, the included recipe comes from the itertools module and the izip_longest function is one you import from that module. I included an explicit import in the sample above now.
Tom almost 11 years

Sorry for my bad questions but this code is really complicated for me. I have updated the question to show you my current results. The output is not really meaningful...
Martijn Pieters almost 11 years

You have only one object in input_tweets in that case. The null values are None values returned by the grouper to pad the list up to 5000 elements.
Tom almost 11 years

I have just included the structure of the JSON file. It has round about 750,000 elements. With objects you mean the content from the JSON?
Martijn Pieters almost 11 years

Use just the data element; loop over jsoncontainer['data'] instead. That is your list of tweets.
Tom almost 11 years

When I delete the "{"data": }" it works. But I need this part in the beginning of every file as it is required by the API.
Martijn Pieters almost 11 years

Then create new dictionary and save that. json.dump({'data': list(group)}, outputfile)
Tom almost 11 years

This is it. Thank you so much. Sorry for the inconvenience with me.