How to convert a JSON file to parquet using Apache Spark?

26,515

Spark 1.4 and later

You can use sparkSQL to read first the JSON file into an DataFrame, then writing the DataFrame as parquet file.

val df = sqlContext.read.json("path/to/json/file")
df.write.parquet("path/to/parquet/file")

or

df.save("path/to/parquet/file", "parquet")

Check here and here for examples and more details.

Spark 1.3.1

val df = sqlContext.jsonFile("path/to/json/file")
df.saveAsParquetFile("path/to/parquet/file")

Issue related to Windows and Spark 1.3.1

Saving a DataFrame as a parquet file on Windows will throw a java.lang.NullPointerException, as described here.

In that case, please consider to upgrade to a more recent Spark version.

Share:
26,515
odbhut.shei.chhele
Author by

odbhut.shei.chhele

আমাদের এই বাংলাদেশে ছিল তার বাড়ি কাউকে কিছু না বলে অভিমানে দূর দেশে দিল পারি

Updated on January 06, 2020

Comments

  • odbhut.shei.chhele
    odbhut.shei.chhele over 4 years

    I am new to Apache Spark 1.3.1. How can I convert a JSON file to Parquet?

  • Rami
    Rami over 8 years
    @eddard.stark I have updated my answer to include Spark 1.3.1
  • odbhut.shei.chhele
    odbhut.shei.chhele over 8 years
    getting a NullPointerException when I try to saveAsParquetFile
  • Rami
    Rami over 8 years
    Are you trying this on Spark Shell or in some IDE?
  • odbhut.shei.chhele
    odbhut.shei.chhele over 8 years
    I am using spark-shell
  • odbhut.shei.chhele
    odbhut.shei.chhele over 8 years
    I am using spark-1.3.1-bin-hadoop2.6
  • Rami
    Rami over 8 years
    I have just exactly tried these two lines of code on spark-1.3.1-bin-hadoop2.6 and it worked. Please check your code. and make sure you are not writing in a non-existing directory and you are correctly reading the file into the DataFrame.
  • odbhut.shei.chhele
    odbhut.shei.chhele over 8 years
    I am working inside the bin folder. Is that a problem?
  • Rami
    Rami over 8 years