saving a dataframe to JSON file on local drive in pyspark
Solution 1
Could you not just use
df.toJSON()
as shown here? If not, then first transform into a pandas DataFrame and then write to json.
pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")
Solution 2
When working with large data converting pyspark dataframe to pandas is not advisable. you can use below command to save json file in output directory. Here df is pyspark.sql.dataframe.DataFrame. Part file will be generated inside the output directory by the cluster.
df.coalesce(1).write.format('json').save('/your_path/output_directory')
Solution 3
I would avoid using write.json
since its causing problems on Windows. Using Python's file writing should skip creating the temp directories that are giving you issues.
with open("C:\\Users\\username\\test.json", "w+") as output_file:
output_file.write(df.toJSON())
Related videos on Youtube
Jared
Updated on January 12, 2020Comments
-
Jared almost 4 years
I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. When i give it the path to the directory it returns an error stating it already exists. My assumption based off the documentation was that it would save a json file in the path that you give it.
df.write.json("C:\Users\username")
Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files.
df.write.json("C:\Users\username\test")
And adding a file extension of JSON, produces the same error
df.write.json("C:\Users\username\test.JSON")
-
Brobin over 8 yearsI think you need to give it a complete file name, not just the directory.
-
Jared over 8 yearsTried that as well and updated the post. It seems like there needs to be some sort of temp directory defined, but the documentation doesn't call out that clearly.
-
Kavindu Dodanduwa over 8 yearsDo you have permission to write and make directories for the specific "username" .?
-
Jared over 8 yearsyes, i verified the permissions on that directory and used getpass.getuser() from python to verify that i was logged in as that user via the console.
-
urug over 8 yearstry an alternate approach such as df.toJSON().saveAsTextFile(path)
-
Jared over 8 yearsproduces the same error as the other attmepts
-
Kavindu Dodanduwa over 8 yearsDid you try this on a Linux environment ? Also have you used Spark before :
-
Kavindu Dodanduwa over 8 yearsI too faced such a problem when using windows.. So I changes to Linux where same code worked perfectly ...
-
Jared over 8 yearsThanks for giving it a try. I figured it had something to do with Windows, ughhh....
-
-
Jared over 8 yearsIf I use output_file.write(df.toJSON()) it produces TypeError: expected character buffer object, i'm assuming it is passing it an array which then causes the failure because if I use output_file.write(df.toJSON().first()) it will successfully create the JSON file with only one line in it.
-
Brobin over 8 yearsGreat! I added the escape slashes to my answer.
-
Jared over 8 yearsdf.toJSON() doesn't seem to accept an array, but if I pass it a single line it works. i'm trying to debug this more.
-
Jared over 8 yearsconverting to a Pandas dataframe works perfect, I would probably just use a Pandas dataframe the entire time, unless there are memory or processing issues that would arise from a much larger data set.
-
Wesley Bowman over 8 yearsYeah, I use DataFrames as often as I can. If memory becomes a problem take a look at Dask