Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

24,649

Solution 1

To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.

To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.

Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.

To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:

Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
                    fs.open(source), jsonSchema, Record.class)) {

  reader.initialize();

  try (ParquetWriter<Record> writer = AvroParquetWriter
      .<Record>builder(outputPath)
      .withConf(new Configuration)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
      .withSchema(jsonSchema)
      .build()) {
    for (Record record : reader) {
      writer.write(record);
    }
  }
}

Solution 2

I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)

Have a look at this, may help you.

Share:
24,649

Related videos on Youtube

vijju
Author by

vijju

Updated on May 05, 2021

Comments

  • vijju
    vijju about 3 years

    I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.

    • vijju
      vijju over 7 years
      Thanks in advance....
    • Shivkumar Mallesappa
      Shivkumar Mallesappa about 6 years
      Hello Vijju, I want to implement the same solution , Could you please provide sample code if you have successfully implemented the above solution. Actually I am a bit confused. Thanks.
  • Cherry
    Cherry almost 4 years
    How to make schema generated fields optional, not required?