JSON.parse() on a large array of objects is using way more memory than it should

17,557

Solution 1

A few points to note:

  1. You've found that, for whatever reason, it's much more efficient to do individual JSON.parse() calls on each element of your array instead of one big JSON.parse().
  2. The data format you're generating is under your control. Unless I misunderstood, the data file as a whole does not have to be valid JSON, as long as you can parse it.
  3. It sounds like the only issue with your second, more efficient method is the fragility of splitting the original generated JSON.

This suggests a simple solution: Instead of generating one giant JSON array, generate an individual JSON string for each element of your array - with no newlines in the JSON string, i.e. just use JSON.stringify(item) with no space argument. Then join those JSON strings with newline (or any character that you know will never appear in your data) and write that data file.

When you read this data, split the incoming data on the newline, then do the JSON.parse() on each of those lines individually. In other words, this step is just like your second solution, but with a straightforward string split instead of having to fiddle with the character counts and curly braces.

Your code might look something like this (really just a simplified version of what you posted):

var fs = require('fs');
var arr2 = fs.readFileSync(
    'JMdict-all.json',
    { encoding: 'utf8' }
).trim().split('\n').map( function( line ) {
    return JSON.parse( line );
});

As you noted in an edit, you could simplify this code to:

var fs = require('fs');
var arr2 = fs.readFileSync(
    'JMdict-all.json',
    { encoding: 'utf8' }
).trim().split('\n').map( JSON.parse );

But I would be careful about this. It does work in this particular case, but there is a potential danger in the more general case.

The JSON.parse function takes two arguments: the JSON text and an optional "reviver" function.

The [].map() function passes three arguments to the function it calls: the item value, array index, and the entire array.

So if you pass JSON.parse directly, it is being called with JSON text as the first argument (as expected), but it is also being passed a number for the "reviver" function. JSON.parse() ignores that second argument because it is not a function reference, so you're OK here. But you can probably imagine other cases where you could get into trouble - so it's always a good idea to triple-check this when you pass an arbitrary function that you didn't write into [].map().

Solution 2

I think a comment hinted at the answer to this question, but I'll expand on it a little. The 1 GB of memory being used presumably includes a large number of allocations of data that is actually 'dead' (in that it has become unreachable and is therefore not really being used by the program any more) but has not yet been collected by the Garbage Collector.

Almost any algorithm processing a large data set is likely to produce a very large amount of detritus in this manner, when the programming language/technology used is a typical modern one (e.g. Java/JVM, c#/.NET, JavaScript). Eventually the GC removes it.

It is interesting to note that techniques can be used to dramatically reduce the amount of ephemeral memory allocation that certain algorithms incur (by having pointers into the middles of strings), but I think these techniques are hard or impossible to employ in JavaScript.

Share:
17,557
Ahmed Fasih
Author by

Ahmed Fasih

Full-stack dev & electrical engineer— JavaScript, TypeScript, Python, C, C++, Elm, Clojure Node, React, Numpy signal processing, machine learning, numerical methods high-performance computing and CUDA GIS (QGIS, GDAL, OGR) Japanese NLP (And I helped with this obnoxious Matlab problem too!) Profile photo via https://www.flickr.com/photos/ain-t_looking_for_nothing/8757900003/

Updated on June 14, 2022

Comments

  • Ahmed Fasih
    Ahmed Fasih almost 2 years

    I generate a ~200'000-element array of objects (using object literal notation inside map rather than new Constructor()), and I'm saving a JSON.stringify'd version of it to disk, where it takes up 31 MB, including newlines and one-space-per-indentation level (JSON.stringify(arr, null, 1)).

    Then, in a new node process, I read the entire file into a UTF-8 string and pass it to JSON.parse:

    var fs = require('fs');
    var arr1 = JSON.parse(fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}));
    

    Node memory usage is about 1.05 GB according to Mavericks' Activity Monitor! Even typing into a Terminal feels laggier on my ancient 4 GB RAM machine.

    But if, in a new node process, I load the file's contents into a string, chop it up at element boundaries, and JSON.parse each element individually, ostensibly getting the same object array:

    var fs = require('fs');
    var arr2 = fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}).trim().slice(1,-3).split('\n },').map(function(s) {return JSON.parse(s+'}');});
    

    node is using just ~200 MB of memory, and no noticeable system lag. This pattern persists across many restarts of node: JSON.parseing the whole array takes a gig of memory while parsing it element-wise is much more memory-efficient.

    Why is there such a huge disparity in memory usage? Is this a problem with JSON.parse preventing efficient hidden class generation in V8? How can I get good memory performance without slicing-and-dicing strings? Must I use a streaming JSON parse 😭?

    For ease of experimentation, I've put the JSON file in question in a Gist, please feel free to clone it.

  • Ahmed Fasih
    Ahmed Fasih almost 9 years
    <del>Is there a name for ‘field-separated JSON’? I've actually created files like this before, using tabs, but always felt shady, in part because of hybridizing JSON and TSV, but more seriously also because I never knew what to call this or what file extension to use. I wouldn't want to call it JSON, that'll just cause endless confusion.</del> en.wikipedia.org/wiki/Line_Delimited_JSON looks like it's a thing.
  • Michael Geary
    Michael Geary almost 9 years
    That's a good point, you wouldn't call the file as a whole JSON, even if each line of it is a JSON text. I would pick any extension you like, or let me suggest: .data :-)
  • Ahmed Fasih
    Ahmed Fasih almost 9 years
    Line-delimited JSON is a thing, who knew! .ldjson or .ldj is apparently the file extension, or .jsonl.
  • Michael Geary
    Michael Geary almost 9 years
    Aha! Completing the circle, I added this page as a citation on that Wikipedia article...
  • Ahmed Fasih
    Ahmed Fasih almost 9 years
    Eeeeek, I didn't know Array.map would pass multiple arguments to a 'first-class function' given to it 😱 it'd be best to always curry such arguments to map.