how to parse a huge JSON file without loading it in memory

13,456

Solution 1

You should definitely check different approaches and libraries. If you are really take care about performance check: Gson, Jackson and JsonPath libraries to do that and choose the fastest one. Definitely you have to load the whole JSON file on local disk, probably TMP folder and parse it after that.

Simple JsonPath solution could look like below:

import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;

import java.io.File;

public class JsonPathApp {
    public static void main(String[] args) throws Exception {
        File jsonFile = new File("./resource/test.json").getAbsoluteFile();

        DocumentContext documentContext = JsonPath.parse(jsonFile);
        System.out.println("" + documentContext.read("$.a"));
        System.out.println("" + documentContext.read("$.b"));
        System.out.println("" + documentContext.read("$.d"));
    }
}

Notice, that I do not create any POJO, just read given values using JSONPath feature similarly to XPath. The same you can do with Jackson:

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

import java.io.File;

public class JsonPathApp {
    public static void main(String[] args) throws Exception {
        File jsonFile = new File("./resource/test.json").getAbsoluteFile();

        ObjectMapper mapper = new ObjectMapper();
        JsonNode root = mapper.readTree(jsonFile);
        System.out.println(root.get("a"));
        System.out.println(root.get("b"));
        System.out.println(root.get("d"));
    }
}

We do not need JSONPath because values we need are directly in root node. As you can see, API looks almost the same. We can also create POJO structure:

import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.databind.ObjectMapper;

import java.io.File;
import java.math.BigDecimal;

public class JsonPathApp {
    public static void main(String[] args) throws Exception {
        File jsonFile = new File("./resource/test.json").getAbsoluteFile();

        ObjectMapper mapper = new ObjectMapper();
        Pojo pojo = mapper.readValue(jsonFile, Pojo.class);
        System.out.println(pojo);
    }
}

@JsonIgnoreProperties(ignoreUnknown = true)
class Pojo {
    private Integer a;
    private BigDecimal b;
    private Integer d;

    // getters, setters
}

Even so, both libraries allow to read JSON payload directly from URL I suggest to download it in another step using best approach you can find. For more info, read this article: Download a File From an URL in Java.

Solution 2

There are some excellent libraries for parsing large JSON files with minimal resources. One is the popular GSON library. It gets at the same effect of parsing the file as both stream and object. It handles each record as it passes, then discards the stream, keeping memory usage low.

If you’re interested in using the GSON approach, there’s a great tutorial for that here. Detailed Tutorial

Solution 3

I only want the integer values stored for keys a, b and d and ignore the rest of the JSON (i.e. ignore whatever is there in the c value). ... How do I do this without loading the entire file in memory?

One way would be to use 's so-called streaming parser, invoked with the --stream option. This does exactly what you want, but there is a trade-off between space and time, and using the streaming parser is usually more difficult.

In the present case, for example, using the non-streaming (i.e., default) parser, one could simply write:

jq '.a, .b, .d' big.json

Using the streaming parser, you would have to write something like:

jq --stream 'select(length==2 and .[0][-1] == ("a","b","c"))[1]' big.json

or if you prefer:

jq -c --stream '["a","b","d"] as $keys | select(length==2 and (.[0][-1] | IN($keys[])))[1]' big.json

Note on Java and jq

Although there are Java bindings for jq (see e.g. "𝑸: What language bindings are available for Java?" in the jq FAQ), I do not know any that work with the --stream option.

However, since 2.5MB is tiny for jq, you could use one of the available Java-jq bindings without bothering with the streaming parser.

Share:
13,456
Sumit
Author by

Sumit

Java, Javascript, Node.js, GoLang, React.js, Angular.js.

Updated on June 29, 2022

Comments

  • Sumit
    Sumit almost 2 years

    I have a large JSON file (2.5MB) containing about 80000 lines.

    It looks like this:

    {
      "a": 123,
      "b": 0.26,
      "c": [HUGE irrelevant object],
      "d": 32
    }
    

    I only want the integer values stored for keys a, b and d and ignore the rest of the JSON (i.e. ignore whatever is there in the c value).

    I cannot modify the original JSON as it is created by a 3rd party service, which I download from its server.

    How do I do this without loading the entire file in memory?

    I tried using gson library and created the bean like this:

    public class MyJsonBean {
      @SerializedName("a")
      @Expose
      public Integer a;
    
      @SerializedName("b")
      @Expose
      public Double b;
    
      @SerializedName("d")
      @Expose
      public Integer d;
    }
    

    but even then in order to deserialize it using Gson, I need to download + read the whole file in memory first and the pass it as a string to Gson?

    File myFile = new File(<FILENAME>);
    myFile.createNewFile();
    
    URL url = new URL(<URL>);
    OutputStream out = new BufferedOutputStream(new FileOutputStream(myFile));
    URLConnection conn = url.openConnection();
    
    HttpURLConnection httpConn = (HttpURLConnection) conn;
    
    InputStream in = conn.getInputStream();
    byte[] buffer = new byte[1024];
    
    int numRead;
    while ((numRead = in.read(buffer)) != -1) {
      out.write(buffer, 0, numRead);
    }
    
    FileInputStream fis = new FileInputStream(myFile);
    byte[] data = new byte[(int) myFile.length()];
    fis.read(data);
    String str = new String(data, "UTF-8");
    
    Gson gson = new Gson();
    MyJsonBean response = gson.fromJson(str, MyJsonBean.class);
    
    System.out.println("a: " + response.a + "" + response.b + "" + response.d);
    

    Is there any way to avoid loading the whole file and just get the relevant values that I need?