PySpark: spit out single file when writing instead of multiple part files
15,591
Well, the answer to your exact question is coalesce
function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.
df.coalesce(1).write.format('json').save('myfile.json')
P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.
Author by
mar tin
Updated on July 25, 2022Comments
-
mar tin almost 2 years
Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?
If I run:
df.write.format('json').save('myfile.json')
or
df1.write.json('myfile.json')
it creates the folder named
myfile
and within it I find several small files namedpart-***
, the HDFS way. Is it by any means possible to have it spit out a single file instead? -
the.malkolm about 8 years
df.coalesce(1).write.json('myfile.json')
works fine -
mar tin about 8 yearsAbout the non-validity of the JSON, this happens in any case, even when spitting out several files.
-
the.malkolm about 8 years@martina, yep. It is confusing sometimes to see
.json
extension and no valid json file inside :D -
Zahiduzzaman about 7 yearsremember you do need to concat the parts files after it is produced. spark.apache.org/docs/latest/api/python/…
-
Matěj Račinský over 4 yearsfor me, this line created directory named
myfile.json
with one part file inside (using spark 2.4)