How to extract multiple JSON objects from one file?

python json pandas dataframe parsing

126,889

Solution 1

Use a json array, in the format:

[
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
  "Code":[{"event1":"A","result":"1"},…]},
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
  "Code":[{"event1":"B","result":"1"},…]},
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
  "Code":[{"event1":"B","result":"0"},…]},
...
]

Then import it into your python code

import json

with open('file.json') as json_file:

    data = json.load(json_file)

Now the content of data is an array with dictionaries representing each of the elements.

You can access it easily, i.e:

data[0]["ID"]

Solution 2

Update: I wrote a solution that doesn't require reading the entire file in one go. It's too big for a stackoverflow answer, but can be found here jsonstream.

You can use json.JSONDecoder.raw_decode to decode arbitarily big strings of "stacked" JSON (so long as they can fit in memory). raw_decode stops once it has a valid object and returns the last position where wasn't part of the parsed object. It's not documented, but you can pass this position back to raw_decode and it start parsing again from that position. Unfortunately, the Python json module doesn't accept strings that have prefixing whitespace. So we need to search to find the first none-whitespace part of your document.

from json import JSONDecoder, JSONDecodeError
import re

NOT_WHITESPACE = re.compile(r'[^\s]')

def decode_stacked(document, pos=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, pos)
        if not match:
            return
        pos = match.start()
        
        try:
            obj, pos = decoder.raw_decode(document, pos)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

s = """

{"a": 1}  


   [
1
,   
2
]


"""

for obj in decode_stacked(s):
    print(obj)

prints:

{'a': 1}
[1, 2]

Solution 3

So, as was mentioned in a couple comments containing the data in an array is simpler but the solution does not scale well in terms of efficiency as the data set size increases. You really should only use an iterable object when you want to access a random item in the array, otherwise, generators are the way to go. Below I have prototyped a reader function which reads each json object individually and returns a generator.

The basic idea is to signal the reader to split on the carriage character "\n" (or "\r\n" for Windows). Python can do this with the file.readline() function.

import json
def json_reader(filename):
    with open(filename) as f:
        for line in f:
            yield json.loads(line)

However, this method only really works when the file is written as you have it -- with each object separated by a newline character. Below I wrote an example of a writer that separates an array of json objects and saves each one on a new line.

def json_writer(file, json_objects):
    with open(file, "w") as f:
        for jsonobj in json_objects:
            jsonstr = json.dumps(jsonobj)
            f.write(jsonstr + "\n")

You could also do the same operation with file.writelines() and a list comprehension:

...
    json_strs = [json.dumps(j) + "\n" for j in json_objects]
    f.writelines(json_strs)
...

And if you wanted to append the data instead of writing a new file just change open(file, "w") to open(file, "a").

In the end I find this helps a great deal not only with readability when I try and open json files in a text editor but also in terms of using memory more efficiently.

On that note if you change your mind at some point and you want a list out of the reader, Python allows you to put a generator function inside of a list and populate the list automatically. In other words, just write

lst = list(json_reader(file))

Solution 4

Added streaming support based on the answer of @dunes:

import re
from json import JSONDecoder, JSONDecodeError

NOT_WHITESPACE = re.compile(r"[^\s]")


def stream_json(file_obj, buf_size=1024, decoder=JSONDecoder()):
    buf = ""
    ex = None
    while True:
        block = file_obj.read(buf_size)
        if not block:
            break
        buf += block
        pos = 0
        while True:
            match = NOT_WHITESPACE.search(buf, pos)
            if not match:
                break
            pos = match.start()
            try:
                obj, pos = decoder.raw_decode(buf, pos)
            except JSONDecodeError as e:
                ex = e
                break
            else:
                ex = None
                yield obj
        buf = buf[pos:]
    if ex is not None:
        raise ex

View more solutions

126,889

user6396

Updated on July 09, 2022

Comments

user6396 almost 2 years
I am very new to Json files. If I have a json file with multiple json objects such as following:
```
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
 "Code":[{"event1":"A","result":"1"},…]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
 "Code":[{"event1":"B","result":"1"},…]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
 "Code":[{"event1":"B","result":"0"},…]}
…
```
I want to extract all "Timestamp" and "Usefulness" into a data frames:
```
    Timestamp    Usefulness
 0   20140101      Yes
 1   20140102      No
 2   20140103      No
 …
```
Does anyone know a general way to deal with such problems?
- njzk2 over 9 years
  
  having a single json array containing all your json object would be quite easier
- Diego Marino over 5 years
  
  https://stackoverflow.com/questions/53788395/tweets-streamed‌-using-tweepy-readin‌g-json-file-in-pytho‌n/53789187#53789187
exa over 8 years

This is cool, but prevents you to use the file as an endless stream (e.g. log-like append-only file data) and consumes a lot more memory.
martineau about 5 years

I, too, like this answer quite a bit except for a couple of things: It requires reading the entire file into memory and its use of undocumented features of the JSONDecoder.
David Culbreth almost 5 years

@exa, this is true, but if you need append-only logging for this data stream, perhaps you should be looking at a format other than JSON to transfer your information, as JSON requires the closing bracket for all data structures, implying a non-infinite non-stream format.
Giacomo Alzetta over 4 years

Just FYI: there is a simple escape for non-whitespace characters: \S. The upper case variants are the negation of the lower case ones (so \W = [^\w], \D=[^\d] ecc.)
Abilash Amarasekaran about 3 years

This works for AWS Lambda if the file has single line multi JSON file.. Can you explain in more details how this works? I m not able to understand raw_decode or how it can understand when a valid json starts or ends
Clément over 2 years

What does "You really should only use an iterator when you want to access a random object in the array" mean? Did you mean "list" instead of "iterator"?
Dan Temkin over 2 years

@Clément I meant Iterable. That's my bad.
Clément over 2 years

Iterable doesn't provide random access, AFAIK
Eli Burke about 2 years

This is great, thanks! If you are processing large data files, crank up the block size (about 4MB benchmarked the fastest for me on files from 10MB-2GB) otherwise you get a lot of spurious exceptions from raw_decode which slows it way down.
Joe almost 2 years

You can use it like: log = [] with open(log_file, 'r') as f: for record in stream_json(f): log.append(record)