Python: make a list generator JSON serializable

python json out-of-memory generator

30,864

Solution 1

You should derive from list and override __iter__ method.

import json

def gen():
    yield 20
    yield 30
    yield 40

class StreamArray(list):
    def __iter__(self):
        return gen()

    # according to the comment below
    def __len__(self):
        return 1

a = [1,2,3]
b = StreamArray()

print(json.dumps([1,a,b]))

Result is [1, [1, 2, 3], [20, 30, 40]].

Solution 2

As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)

result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Solution 3

This universal solution is useful also for really huge data" if a result string couldn't fit easily in memory, but it can be still easily written to a stream from a JSON iterator. (This is better than "import simplejson ..." that can help, but not too much). Tested with Python 2.7, 3.0, 3.3, 3.6, 3.10.0a7. Two times faster than simplejson. Small memory footprint. Written unit tests.

import itertools

class SerializableGenerator(list):
    """Generator that is serializable by JSON"""

    def __init__(self, iterable):
        tmp_body = iter(iterable)
        try:
            self._head = iter([next(tmp_body)])
            self.append(tmp_body)
        except StopIteration:
            self._head = []

    def __iter__(self):
        return itertools.chain(self._head, *self[:1])

Normal usage (little memory for input, but still make the whole output string in memory):

>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter([])))
"[]"

For really huge data it can be used as generator of JSON chunks in Python 3 and still use very little memory:

>>> iter_json = json.JSONEncoder().iterencode(SerializableGenerator(iter(range(1000000))))
>>> for chunk in iter_json:
...     stream.write(chunk)
# or a naive examle
>>> tuple(iter_json)
('[1', ', 2', ... ', 1000000', ']')

The class is used by a normal JSONEncoder().encode(...) internally by json.dumps(...) or by an explicit JSONEncoder().iterencode(...) to get an generator of JSON chunks instead.

(The function iter() in the examples is not necessary for it to work, only to demonstrate a non trivial input that has no known length.)

Test:

import unittest
import json
# from ?your_module? import SerializableGenerator 


class Test(unittest.TestCase):

    def combined_dump_assert(self, iterable, expect):
        self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)

    def combined_iterencode_assert(self, iterable, expect):
        encoder = json.JSONEncoder().iterencode
        self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)

    def test_dump_data(self):
        self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')

    def test_dump_empty(self):
        self.combined_dump_assert(iter([]), '[]')

    def test_iterencode_data(self):
        self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))

    def test_iterencode_empty(self):
        self.combined_iterencode_assert(iter([]), ('[]',))

    def test_that_all_data_are_consumed(self):
        gen = SerializableGenerator(iter([1, 2]))
        list(gen)
        self.assertEqual(list(gen), [])

This solution is inspired by three older answers: Vadim Pushtaev (some problem with empty iterable) and user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

Important differences from these solutions are:

Important methods __len__, __bool__ and other are inherited consistently from a list class meaningfully initialized.
The first item of the input is evaluated immediately by __init__ (not lazily triggered by many other methods) The list class can know at once if the iterator is empty or not. A non empty list contains one item with the generator or the list is empty if the iterator is empty.
The correct implementation of length for an empty iterator is important for the JSONEncoder.iterencode(...) method.
All other methods give a meaningful output, e.g. __repr__:

   >>> SerializableGenerator((x for x in range(3)))
   [<generator object <genexpr> at 0x........>]

An advantage of this solution is that a standard JSON serializer can be used. If nested generators should be supported then the solution with simplejson is probably the best and it has also similar variant with iterencode(...).

Stub *.pyi for strong typing:

from typing import Any, Iterable, Iterator

class SerializableGenerator(list):
    def __init__(self, iterable: Iterable[Any]) -> None: ...
    def __iter__(self) -> Iterator: ...

Solution 4

Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

The suggestion that self.__tail__ might be immutable
len(StreamArray(some_gen)) is either 0 or 1

class StreamArray(list):

    def __init__(self, gen):
        self.gen = gen

    def destructure(self):
        try:
            return self.__head__, self.__tail__, self.__len__
        except AttributeError:
            try:
                self.__head__ = self.gen.__next__()
                self.__tail__ = self.gen
                self.__len__ = 1 # A lie
            except StopIteration:
                self.__head__ = None
                self.__tail__ = []
                self.__len__ = 0
            return self.__head__, self.__tail__, self.__len__

    def rebuilt_gen(self):
        def rebuilt_gen_inner():
            head, tail, len_ = self.destructure()
            if len_ > 0:
                yield head
            for elem in tail:
                yield elem
        try:
            return self.__rebuilt_gen__
        except AttributeError:
            self.__rebuilt_gen__ = rebuilt_gen_inner()
            return self.__rebuilt_gen__

    def __iter__(self):
        return self.rebuilt_gen()

    def __next__(self):
        return self.rebuilt_gen()

    def __len__(self):
        return self.destructure()[2]

Single use only!

Solution 5

I was getting this error in the map-reduce task with mrjob. It got resolved after handling the iterator properly.

If you are not handling iterator yield by the mapper you will get this error.

View more solutions

30,864

Author by

Sebastian Wagner

Updated on August 05, 2022

Comments

Sebastian Wagner almost 2 years
How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not optimized for a large input.
```
jq -s -r '[.[][]]' *.js 
```
This command works, but takes way too long to complete and I really would like to solve this with Python.

Here is my current code:
```
def concatFiles(outName, inFileNames):
    def listGenerator():
        for inName in inFileNames:
            with open(inName, 'r') as f:
                for item in json.load(f):
                    yield item

    with open(outName, 'w') as f:
        json.dump(listGenerator(), f)
```
I'm getting:
```
TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable
```
Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?