Parsing large JSON file in .NET

41,851

Solution 1

As you've correctly diagnosed in your update, the issue is that the JSON has a closing ] followed immediately by an opening [ to start the next set. This format makes the JSON invalid when taken as a whole, and that is why Json.NET throws an error.

Fortunately, this problem seems to come up often enough that Json.NET actually has a special setting to deal with it. If you use a JsonTextReader directly to read the JSON, you can set the SupportMultipleContent flag to true, and then use a loop to deserialize each item individually.

This should allow you to process the non-standard JSON successfully and in a memory efficient manner, regardless of how many arrays there are or how many items in each array.

    using (WebClient client = new WebClient())
    using (Stream stream = client.OpenRead(stringUrl))
    using (StreamReader streamReader = new StreamReader(stream))
    using (JsonTextReader reader = new JsonTextReader(streamReader))
    {
        reader.SupportMultipleContent = true;

        var serializer = new JsonSerializer();
        while (reader.Read())
        {
            if (reader.TokenType == JsonToken.StartObject)
            {
                Contact c = serializer.Deserialize<Contact>(reader);
                Console.WriteLine(c.FirstName + " " + c.LastName);
            }
        }
    }

Full demo here: https://dotnetfiddle.net/2TQa8p

Solution 2

Json.NET supports deserializing directly from a stream. Here is a way to deserialize your JSON using a StreamReader reading the JSON string one piece at a time instead of having the entire JSON string loaded into memory.

using (WebClient client = new WebClient())
{
    using (StreamReader sr = new StreamReader(client.OpenRead(stringUrl)))
    {
        using (JsonReader reader = new JsonTextReader(sr))
        {
            JsonSerializer serializer = new JsonSerializer();

            // read the json from a stream
            // json size doesn't matter because only a small piece is read at a time from the HTTP request
            IList<Contact> result = serializer.Deserialize<List<Contact>>(reader);
        }
    }
}

Reference: JSON.NET Performance Tips

Solution 3

I have done a similar thing in Python for the file size of 5 GB. I downloaded the file in some temporary location and read it line by line to form an JSON object similar on how SAX works.

For C# using Json.NET, you can download the file, use a stream reader to read the file, and pass that stream to JsonTextReader and parse it to JObject using JTokens.ReadFrom(your JSonTextReader object).

Share:
41,851
Yavar Hasanov
Author by

Yavar Hasanov

Updated on July 13, 2020

Comments

  • Yavar Hasanov
    Yavar Hasanov almost 4 years

    I have used the "JsonConvert.Deserialize(json)" method of Json.NET so far which worked quite well and to be honest, I didn't need anything more than this.

    I am working on a background (console) application which constantly downloads the JSON content from different URLs, then deserializes the result into a list of .NET objects.

     using (WebClient client = new WebClient())
     {
          string json = client.DownloadString(stringUrl);
    
          var result = JsonConvert.DeserializeObject<List<Contact>>(json);
    
     }
    

    The simple code snippet above doesn't probably seem perfect, but it does the job. When the file is large (15,000 contacts - 48 MB file), JsonConvert.DeserializeObject isn't the solution and the line throws an exception type of JsonReaderException.

    The downloaded JSON content is an array and this is how a sample looks like. Contact is a container class for the deserialized JSON object.

    [
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      }
    ]
    

    My initial guess is it runs out of memory. Just out of curiosity, I tried to parse it as JArray which caused the same exception too.

    I have started to dive into Json.NET documentation and read similar threads. As I haven't managed to produce a working solution yet, I decided to post a question here.

    UPDATE: While deserializing line by line, I got the same error: " [. Path '', line 600003, position 1." So downloaded two of them and checked them in Notepad++. I noticed that if the array length is more than 12,000, after 12000th element, the "[" is closed and another array starts. In other words, the JSON looks exactly like this:

    [
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      }
    ]
    [
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      },
      {
        "firstname": "sometext",
        "lastname": "sometext"
      }
    ]
    
  • Yavar Hasanov
    Yavar Hasanov almost 9 years
    It makes sense. I will try this and post the updates here.Thanks a mil.
  • nixdaemon
    nixdaemon over 8 years
    Look for "Kristian" answer below. He has the code implementation its pretty similar concept on what i have explained above but i like "Kristian" approach better :)
  • Ibraheem Al-Saady
    Ibraheem Al-Saady over 7 years
    I was this close to build my own parser. this is awesome, thanks Brian.
  • John Bledsoe
    John Bledsoe over 6 years
    This code may not load the entire stream into memory, but will certainly load the entire list of contacts into memory. Unless the Contact object throws away large amounts of data from the stream, you've just pushed your memory problem downstream.