Parse large JSON file in Nodejs

131,744

Solution 1

To process a file line-by-line, you simply need to decouple the reading of the file and the code that acts upon that input. You can accomplish this by buffering your input until you hit a newline. Assuming we have one JSON object per line (basically, format B):

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('\n')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        processLine(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function processLine(line) { // here's where we do something with a line

    if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

Each time the file stream receives data from the file system, it's stashed in a buffer, and then pump is called.

If there's no newline in the buffer, pump simply returns without doing anything. More data (and potentially a newline) will be added to the buffer the next time the stream gets data, and then we'll have a complete object.

If there is a newline, pump slices off the buffer from the beginning to the newline and hands it off to process. It then checks again if there's another newline in the buffer (the while loop). In this way, we can process all of the lines that were read in the current chunk.

Finally, process is called once per input line. If present, it strips off the carriage return character (to avoid issues with line endings – LF vs CRLF), and then calls JSON.parse one the line. At this point, you can do whatever you need to with your object.

Note that JSON.parse is strict about what it accepts as input; you must quote your identifiers and string values with double quotes. In other words, {name:'thing1'} will throw an error; you must use {"name":"thing1"}.

Because no more than a chunk of data will ever be in memory at a time, this will be extremely memory efficient. It will also be extremely fast. A quick test showed I processed 10,000 rows in under 15ms.

Solution 2

As of October 2014, you can just do something like the following (using JSONStream) - https://www.npmjs.org/package/JSONStream

var fs = require('fs'),
    JSONStream = require('JSONStream'),

var getStream() = function () {
    var jsonData = 'myData.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
}

getStream().pipe(MyTransformToDoWhateverProcessingAsNeeded).on('error', function (err) {
    // handle any errors
});

To demonstrate with a working example:

npm install JSONStream event-stream

data.json:

{
  "greeting": "hello world"
}

hello.js:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    es = require('event-stream');

var getStream = function () {
    var jsonData = 'data.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
};

getStream()
    .pipe(es.mapSync(function (data) {
        console.log(data);
    }));
$ node hello.js
// hello world

Solution 3

I realize that you want to avoid reading the whole JSON file into memory if possible, however if you have the memory available it may not be a bad idea performance-wise. Using node.js's require() on a json file loads the data into memory really fast.

I ran two tests to see what the performance looked like on printing out an attribute from each feature from a 81MB geojson file.

In the 1st test, I read the entire geojson file into memory using var data = require('./geo.json'). That took 3330 milliseconds and then printing out an attribute from each feature took 804 milliseconds for a grand total of 4134 milliseconds. However, it appeared that node.js was using 411MB of memory.

In the second test, I used @arcseldon's answer with JSONStream + event-stream. I modified the JSONPath query to select only what I needed. This time the memory never went higher than 82MB, however, the whole thing now took 70 seconds to complete!

Solution 4

I had similar requirement, i need to read a large json file in node js and process data in chunks and call a api and save in mongodb. inputFile.json is like:

{
 "customers":[
       { /*customer data*/},
       { /*customer data*/},
       { /*customer data*/}....
      ]
}

Now i used JsonStream and EventStream to achieve this synchronously.

var JSONStream = require("JSONStream");
var es = require("event-stream");

fileStream = fs.createReadStream(filePath, { encoding: "utf8" });
fileStream.pipe(JSONStream.parse("customers.*")).pipe(
  es.through(function(data) {
    console.log("printing one customer object read from file ::");
    console.log(data);
    this.pause();
    processOneCustomer(data, this);
    return data;
  }),
  function end() {
    console.log("stream reading ended");
    this.emit("end");
  }
);

function processOneCustomer(data, es) {
  DataModel.save(function(err, dataModel) {
    es.resume();
  });
}

Solution 5

I wrote a module that can do this, called BFJ. Specifically, the method bfj.match can be used to break up a large stream into discrete chunks of JSON:

const bfj = require('bfj');
const fs = require('fs');

const stream = fs.createReadStream(filePath);

bfj.match(stream, (key, value, depth) => depth === 0, { ndjson: true })
  .on('data', object => {
    // do whatever you need to do with object
  })
  .on('dataError', error => {
    // a syntax error was found in the JSON
  })
  .on('error', error => {
    // some kind of operational error occurred
  })
  .on('end', error => {
    // finished processing the stream
  });

Here, bfj.match returns a readable, object-mode stream that will receive the parsed data items, and is passed 3 arguments:

  1. A readable stream containing the input JSON.

  2. A predicate that indicates which items from the parsed JSON will be pushed to the result stream.

  3. An options object indicating that the input is newline-delimited JSON (this is to process format B from the question, it's not required for format A).

Upon being called, bfj.match will parse JSON from the input stream depth-first, calling the predicate with each value to determine whether or not to push that item to the result stream. The predicate is passed three arguments:

  1. The property key or array index (this will be undefined for top-level items).

  2. The value itself.

  3. The depth of the item in the JSON structure (zero for top-level items).

Of course a more complex predicate can also be used as necessary according to requirements. You can also pass a string or a regular expression instead of a predicate function, if you want to perform simple matches against property keys.

Share:
131,744
dgh
Author by

dgh

Updated on March 26, 2020

Comments

  • dgh
    dgh over 4 years

    I have a file which stores many JavaScript objects in JSON form and I need to read the file, create each of the objects, and do something with them (insert them into a db in my case). The JavaScript objects can be represented a format:

    Format A:

    [{name: 'thing1'},
    ....
    {name: 'thing999999999'}]
    

    or Format B:

    {name: 'thing1'}         // <== My choice.
    ...
    {name: 'thing999999999'}
    

    Note that the ... indicates a lot of JSON objects. I am aware I could read the entire file into memory and then use JSON.parse() like this:

    fs.readFile(filePath, 'utf-8', function (err, fileContents) {
      if (err) throw err;
      console.log(JSON.parse(fileContents));
    });
    

    However, the file could be really large, I would prefer to use a stream to accomplish this. The problem I see with a stream is that the file contents could be broken into data chunks at any point, so how can I use JSON.parse() on such objects?

    Ideally, each object would be read as a separate data chunk, but I am not sure on how to do that.

    var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
    importStream.on('data', function(chunk) {
    
        var pleaseBeAJSObject = JSON.parse(chunk);           
        // insert pleaseBeAJSObject in a database
    });
    importStream.on('end', function(item) {
       console.log("Woot, imported objects into the database!");
    });*/
    

    Note, I wish to prevent reading the entire file into memory. Time efficiency does not matter to me. Yes, I could try to read a number of objects at once and insert them all at once, but that's a performance tweak - I need a way that is guaranteed not to cause a memory overload, not matter how many objects are contained in the file.

    I can choose to use FormatA or FormatB or maybe something else, just please specify in your answer. Thanks!

  • josh3736
    josh3736 almost 12 years
    This doesn't answer the question. Note that the second line of the question says he wants to do this to get data into a database.
  • arcseldon
    arcseldon almost 10 years
    This answer is now redundant. Use JSONStream, and you have out of the box support.
  • John Zwinck
    John Zwinck over 9 years
    This is mostly true and useful, but I think you need to do parse('*') or you won't get any data.
  • arcseldon
    arcseldon over 9 years
    @JohnZwinck Thank you, have updated the answer, and added a working example to demonstrate it fully.
  • Zhigong Li
    Zhigong Li about 9 years
    The function name 'process' is bad. 'process' should be a system variable. This bug confused me for hours.
  • Ahmed Fasih
    Ahmed Fasih about 9 years
    Please consider editing and adding a note that dedicated libraries now exist to do this, and may be preferable to this hand-rolled solution. See @arcseldon's answer at stackoverflow.com/a/24710073/500207
  • givemesnacks
    givemesnacks almost 9 years
    in the first code block, the first set of parentheses var getStream() = function () { should be removed.
  • Kevin B
    Kevin B almost 9 years
    @arcseldon I don't think the fact that there's a library that does this makes this answer redundant. It's certainly still useful to know how this can be done without the module.
  • SLearner
    SLearner almost 9 years
    I am not sure if this would work for a minified json file. What if the whole file was wrapped up in a single line, and using any such delimiters wasn't possible? How do we solve this problem then?
  • zanona
    zanona about 8 years
    Third party libraries are not made of magic you know. They are just like this answer, elaborated versions of hand-rolled solutions, but just packed and labeled as a program. Understanding how things work is much more important and relevant than blindly throwing data into a library expecting results. Just saying :)
  • Keith John Hutchison
    Keith John Hutchison almost 8 years
    This failed with an out of memory error with a 500mb json file.
  • Haziq Ahmed
    Haziq Ahmed almost 6 years
    mongoimport only import file size upto 16MB.
  • nonNumericalFloat
    nonNumericalFloat over 4 years
    Thank you so much for adding your answer, my case also needed some synchronous handling. However after testing it was not possible for me to call "end()" as a callback after the pipe is finished. I believe the only thing which could be done is adding an event, what should happen after the stream is 'finished' / 'close' with ´fileStream.on('close', ... )´.
  • Dan
    Dan over 3 years
    Doesn't buf += data mean that everything coming back from the large file's stream will be stored in memory anyway? Doesn't this defeat the purpose of using a read stream? It seems like fs.readFile would be just as memory-inefficient.
  • remed.io
    remed.io over 3 years
    Hey - this was a great solution BUT there's a type in your code. You have a parenthesis closing BEFORE [code]function end ()[/code] - but you need to move it afterward - otherwise end () is not included in the es.through().
  • Griffin
    Griffin over 3 years
    @Dan Yes, the data is continually stored in the buffer as its read to be processed, but you'll notice at the end of while loop in pump(), we slice off the processed data after it's sent to processLine() for parsing.