Pyspark: How to convert a spark dataframe to json and save it as json file?

python-3.x pyspark apache-spark-sql pyspark-sql

18,129

Solution 1

For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.

df_final.coalesce(1).write.format('json').save('/path/file_name.json')

and still you want to convert your datafram into json then you can used df_final.toJSON().

Solution 2

If you want to use spark to process result as json files, I think that your output schema is right in hdfs.

And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :

with open('data.json') as f:
  data = json.load(f)

You should try to read data line by line:

data = []
with open("data.json",'r') as datafile:
  for line in datafile:
    data.append(json.loads(line))

and you can use pandas to create dataframe :

df = pd.DataFrame(data)

Solution 3

A solution can be using collect and then using json.dump:

import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
    json.dump(data, outfile)

18,129

Author by

Shankar Panda

Career started with Acccenture, Still in same company. Worked initially in bigdata technologies like Hadoop , Hive, OBIEE. Currently working in Informatica Power center and IDQ as primary technology. Also works on Oracle/ Unix.

Updated on July 25, 2022

Comments

Shankar Panda almost 2 years

I am trying to convert my pyspark sql dataframe to json and then save as a file.

df_final = df_final.union(join_df)

df_final contains the value as such:

I tried something like this. But it created a invalid json.

df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)

{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}

My expected file should have data as below:

[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]

Shankar Panda over 5 years

Actually this correct but it is not creating the file directly in hdfs. It creates on the container where the script runs
Shankar Panda over 5 years

Yeah, but it stores data line by line {"Variable":"Col1","Min":"20","Max":"30"} {"Variable":"Col2","Min":"25,"Max":"40"} instead it should be separated by , and enclosed with square braces
paulochf about 5 years

It uses driver memory, so it's not recommended.
Fahad Ashraf about 3 years

I was trying to understand why there was an answer that was related to reading the json file rather than writing out to it. I understand now, the json format that spark writes out is not comma delimited, and so it must be read back in a little differently. Thank you so much for this
chilun about 3 years

@FahadAshraf Glad that helped. And yes, the json format that spark writes out is not comma delimited. It's very confuse when reading json file which created from spark (or others hdfs schema) at first time.