PySpark: spit out single file when writing instead of multiple part files

15,591

Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.

df.coalesce(1).write.format('json').save('myfile.json')

P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

Share:
15,591
mar tin
Author by

mar tin

Updated on July 25, 2022

Comments

  • mar tin
    mar tin almost 2 years

    Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?

    If I run:

     df.write.format('json').save('myfile.json')
    

    or

    df1.write.json('myfile.json')
    

    it creates the folder named myfile and within it I find several small files named part-***, the HDFS way. Is it by any means possible to have it spit out a single file instead?

  • the.malkolm
    the.malkolm about 8 years
    df.coalesce(1).write.json('myfile.json') works fine
  • mar tin
    mar tin about 8 years
    About the non-validity of the JSON, this happens in any case, even when spitting out several files.
  • the.malkolm
    the.malkolm about 8 years
    @martina, yep. It is confusing sometimes to see .json extension and no valid json file inside :D
  • Zahiduzzaman
    Zahiduzzaman about 7 years
    remember you do need to concat the parts files after it is produced. spark.apache.org/docs/latest/api/python/…
  • Matěj Račinský
    Matěj Račinský over 4 years
    for me, this line created directory named myfile.json with one part file inside (using spark 2.4)