Split JSON file in equal/smaller parts with Python
Use an iteration grouper; the itertools
module recipes list includes the following:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
This lets you iterate over your tweets in groups of 5000:
for i, group in enumerate(grouper(input_tweets, 5000)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
Tom
Updated on August 07, 2022Comments
-
Tom over 1 year
I am currently working on a project where I use Sentiment Analysis for Twitter Posts. I am classifying the Tweets with Sentiment140. With the tool I can classify up to 1,000,000 Tweets per day and I have collected around 750,000 Tweets. So that should be fine. The only problem is that I can send a max of 15,000 Tweets to the JSON Bulk Classification at once.
My whole code is set up and running. The only problem is that my JSON file now contains all 750,000 Tweets.
Therefore my question: What is the best way to split the JSON into smaller files with the same structure? I would prefer to do this in Python.
I have thought about iterating through the file. But how do I specify in the code that it should create a new file after for example 5,000 elements?
I would love to get some hints what the most reasonable approach is. Thank you!
EDIT: This is the code that I have at the moment.
import itertools import json from itertools import izip_longest def grouper(iterable, n, fillvalue=None): "Collect data into fixed-length chunks or blocks" # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx args = [iter(iterable)] * n return izip_longest(fillvalue=fillvalue, *args) # Open JSON file values = open('Tweets.json').read() #print values # Adjust formatting of JSON file values = values.replace('\n', '') # do your cleanup here #print values v = values.encode('utf-8') #print v # Load JSON file v = json.loads(v) print type(v) for i, group in enumerate(grouper(v, 5000)): with open('outputbatch_{}.json'.format(i), 'w') as outputfile: json.dump(list(group), outputfile)
The output gives:
["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]
in a file called: "outputbatch_0.json"
EDIT 2: This is the structure of the JSON.
{ "data": [ { "text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...", "id": "1" }, { "text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6", "id": "2" }, { "text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol", "id": "3"} ] }
-
Tom almost 11 yearsLooks really interesting. But I am struggling with the different parts of the code. I guess I read the json file first with: "values = open('Tweets.json').read()". Can you elaborate a bit on the different parameters? Thanks
-
Tom almost 11 yearsok. Thanks! What is the grouper and the input_tweets parameters?
-
Martijn Pieters almost 11 years
grouper()
is the function I name above it,input_tweets
your sequence of 750.000 tweets. -
Tom almost 11 yearsAh ok - sorry. Do I need to define the izip_longest function as well or should it be included in the module? My code says: "NameError: global name 'izip_longest' is not defined."
-
Martijn Pieters almost 11 yearsSorry, the included recipe comes from the
itertools
module and theizip_longest
function is one you import from that module. I included an explicit import in the sample above now. -
Tom almost 11 yearsSorry for my bad questions but this code is really complicated for me. I have updated the question to show you my current results. The output is not really meaningful...
-
Martijn Pieters almost 11 yearsYou have only one object in
input_tweets
in that case. Thenull
values areNone
values returned by thegrouper
to pad the list up to 5000 elements. -
Tom almost 11 yearsI have just included the structure of the JSON file. It has round about 750,000 elements. With objects you mean the content from the JSON?
-
Martijn Pieters almost 11 yearsUse just the data element; loop over
jsoncontainer['data']
instead. That is your list of tweets. -
Tom almost 11 yearsWhen I delete the "{"data": }" it works. But I need this part in the beginning of every file as it is required by the API.
-
Martijn Pieters almost 11 yearsThen create new dictionary and save that.
json.dump({'data': list(group)}, outputfile)
-
Tom almost 11 yearsThis is it. Thank you so much. Sorry for the inconvenience with me.