Python: make a list generator JSON serializable
Solution 1
You should derive from list
and override __iter__
method.
import json
def gen():
yield 20
yield 30
yield 40
class StreamArray(list):
def __iter__(self):
return gen()
# according to the comment below
def __len__(self):
return 1
a = [1,2,3]
b = StreamArray()
print(json.dumps([1,a,b]))
Result is [1, [1, 2, 3], [20, 30, 40]]
.
Solution 2
As of simplejson 3.8.0, you can use the iterable_as_array
option to make any iterable serializable into an array
# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)
result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Solution 3
This universal solution is useful also for really huge data" if a result string couldn't fit easily in memory, but it can be still easily written to a stream from a JSON iterator. (This is better than "import simplejson ..." that can help, but not too much).
Tested with Python 2.7, 3.0, 3.3, 3.6, 3.10.0a7. Two times faster than simplejson
. Small memory footprint. Written unit tests.
import itertools
class SerializableGenerator(list):
"""Generator that is serializable by JSON"""
def __init__(self, iterable):
tmp_body = iter(iterable)
try:
self._head = iter([next(tmp_body)])
self.append(tmp_body)
except StopIteration:
self._head = []
def __iter__(self):
return itertools.chain(self._head, *self[:1])
Normal usage (little memory for input, but still make the whole output string in memory):
>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter([])))
"[]"
For really huge data it can be used as generator of JSON chunks in Python 3 and still use very little memory:
>>> iter_json = json.JSONEncoder().iterencode(SerializableGenerator(iter(range(1000000))))
>>> for chunk in iter_json:
... stream.write(chunk)
# or a naive examle
>>> tuple(iter_json)
('[1', ', 2', ... ', 1000000', ']')
The class is used by a normal JSONEncoder().encode(...)
internally by json.dumps(...)
or by an explicit JSONEncoder().iterencode(...)
to get an generator of JSON chunks instead.
(The function iter()
in the examples is not necessary for it to work, only to demonstrate a non trivial input that has no known length.)
Test:
import unittest
import json
# from ?your_module? import SerializableGenerator
class Test(unittest.TestCase):
def combined_dump_assert(self, iterable, expect):
self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)
def combined_iterencode_assert(self, iterable, expect):
encoder = json.JSONEncoder().iterencode
self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)
def test_dump_data(self):
self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')
def test_dump_empty(self):
self.combined_dump_assert(iter([]), '[]')
def test_iterencode_data(self):
self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))
def test_iterencode_empty(self):
self.combined_iterencode_assert(iter([]), ('[]',))
def test_that_all_data_are_consumed(self):
gen = SerializableGenerator(iter([1, 2]))
list(gen)
self.assertEqual(list(gen), [])
This solution is inspired by three older answers: Vadim Pushtaev (some problem with empty iterable) and user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).
Important differences from these solutions are:
- Important methods
__len__
,__bool__
and other are inherited consistently from alist
class meaningfully initialized. - The first item of the input is evaluated immediately by
__init__
(not lazily triggered by many other methods) Thelist
class can know at once if the iterator is empty or not. A non emptylist
contains one item with the generator or the list is empty if the iterator is empty. - The correct implementation of length for an empty iterator is important for the
JSONEncoder.iterencode(...)
method. - All other methods give a meaningful output, e.g.
__repr__
:
>>> SerializableGenerator((x for x in range(3)))
[<generator object <genexpr> at 0x........>]
An advantage of this solution is that a standard JSON serializer can be used. If nested generators should be supported then the solution with simplejson is probably the best and it has also similar variant with iterencode(...)
.
Stub *.pyi
for strong typing:
from typing import Any, Iterable, Iterator
class SerializableGenerator(list):
def __init__(self, iterable: Iterable[Any]) -> None: ...
def __iter__(self) -> Iterator: ...
Solution 4
Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:
- The suggestion that
self.__tail__
might be immutable len(StreamArray(some_gen))
is either 0 or 1
.
class StreamArray(list):
def __init__(self, gen):
self.gen = gen
def destructure(self):
try:
return self.__head__, self.__tail__, self.__len__
except AttributeError:
try:
self.__head__ = self.gen.__next__()
self.__tail__ = self.gen
self.__len__ = 1 # A lie
except StopIteration:
self.__head__ = None
self.__tail__ = []
self.__len__ = 0
return self.__head__, self.__tail__, self.__len__
def rebuilt_gen(self):
def rebuilt_gen_inner():
head, tail, len_ = self.destructure()
if len_ > 0:
yield head
for elem in tail:
yield elem
try:
return self.__rebuilt_gen__
except AttributeError:
self.__rebuilt_gen__ = rebuilt_gen_inner()
return self.__rebuilt_gen__
def __iter__(self):
return self.rebuilt_gen()
def __next__(self):
return self.rebuilt_gen()
def __len__(self):
return self.destructure()[2]
Single use only!
Solution 5
I was getting this error in the map-reduce task with mrjob. It got resolved after handling the iterator properly.
If you are not handling iterator yield by the mapper you will get this error.
Sebastian Wagner
Updated on August 05, 2022Comments
-
Sebastian Wagner almost 2 years
How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.
My fist try was to use jq, but it looks like jq -s is not optimized for a large input.
jq -s -r '[.[][]]' *.js
This command works, but takes way too long to complete and I really would like to solve this with Python.
Here is my current code:
def concatFiles(outName, inFileNames): def listGenerator(): for inName in inFileNames: with open(inName, 'r') as f: for item in json.load(f): yield item with open(outName, 'w') as f: json.dump(listGenerator(), f)
I'm getting:
TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable
Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?