saving a dataframe to JSON file on local drive in pyspark

39,234

Solution 1

Could you not just use

df.toJSON()

as shown here? If not, then first transform into a pandas DataFrame and then write to json.

pandas_df = df.toPandas()
pandas_df.to_json("C:\Users\username\test.JSON")

Solution 2

When working with large data converting pyspark dataframe to pandas is not advisable. you can use below command to save json file in output directory. Here df is pyspark.sql.dataframe.DataFrame. Part file will be generated inside the output directory by the cluster.

df.coalesce(1).write.format('json').save('/your_path/output_directory')

Solution 3

I would avoid using write.json since its causing problems on Windows. Using Python's file writing should skip creating the temp directories that are giving you issues.

with open("C:\\Users\\username\\test.json", "w+") as output_file:
    output_file.write(df.toJSON())
Share:
39,234

Related videos on Youtube

Jared
Author by

Jared

Updated on January 12, 2020

Comments

  • Jared
    Jared almost 4 years

    I have a dataframe that I am trying to save as a JSON file using pyspark 1.4, but it doesn't seem to be working. When i give it the path to the directory it returns an error stating it already exists. My assumption based off the documentation was that it would save a json file in the path that you give it.

    df.write.json("C:\Users\username")
    

    Specifying a directory with a name doesn't produce any file and gives and error of "java.io.IOException: Mkdirs failed to create file:/C:Users/username/test/_temporary/....etc. It does however create a directory of the name test which contains several sub-directories with blank crc files.

    df.write.json("C:\Users\username\test")
    

    And adding a file extension of JSON, produces the same error

    df.write.json("C:\Users\username\test.JSON")
    
    • Brobin
      Brobin over 8 years
      I think you need to give it a complete file name, not just the directory.
    • Jared
      Jared over 8 years
      Tried that as well and updated the post. It seems like there needs to be some sort of temp directory defined, but the documentation doesn't call out that clearly.
    • Kavindu Dodanduwa
      Kavindu Dodanduwa over 8 years
      Do you have permission to write and make directories for the specific "username" .?
    • Jared
      Jared over 8 years
      yes, i verified the permissions on that directory and used getpass.getuser() from python to verify that i was logged in as that user via the console.
    • urug
      urug over 8 years
      try an alternate approach such as df.toJSON().saveAsTextFile(path)
    • Jared
      Jared over 8 years
      produces the same error as the other attmepts
    • Kavindu Dodanduwa
      Kavindu Dodanduwa over 8 years
      Did you try this on a Linux environment ? Also have you used Spark before :
    • Kavindu Dodanduwa
      Kavindu Dodanduwa over 8 years
      I too faced such a problem when using windows.. So I changes to Linux where same code worked perfectly ...
    • Jared
      Jared over 8 years
      Thanks for giving it a try. I figured it had something to do with Windows, ughhh....
  • Jared
    Jared over 8 years
    If I use output_file.write(df.toJSON()) it produces TypeError: expected character buffer object, i'm assuming it is passing it an array which then causes the failure because if I use output_file.write(df.toJSON().first()) it will successfully create the JSON file with only one line in it.
  • Brobin
    Brobin over 8 years
    Great! I added the escape slashes to my answer.
  • Jared
    Jared over 8 years
    df.toJSON() doesn't seem to accept an array, but if I pass it a single line it works. i'm trying to debug this more.
  • Jared
    Jared over 8 years
    converting to a Pandas dataframe works perfect, I would probably just use a Pandas dataframe the entire time, unless there are memory or processing issues that would arise from a much larger data set.
  • Wesley Bowman
    Wesley Bowman over 8 years
    Yeah, I use DataFrames as often as I can. If memory becomes a problem take a look at Dask