How to save a spark dataframe to csv on HDFS?

13,236

You could try to change ".save" to ".csv":

df.coalesce(1).write.mode('overwrite').option('header','true').csv('hdfs://path/df.csv')
Share:
13,236
Leah210
Author by

Leah210

Updated on June 18, 2022

Comments

  • Leah210
    Leah210 almost 2 years

    Spark version: 1.6.1, I use pyspark API.

    DataFrame: df, which has two colume.

    I have tried:

    1: df.write.format('csv').save("hdfs://path/bdt_sum_vol.csv")
    2: df.write.save('hdfs://path/bdt_sum_vol.csv', format='csv', mode='append')
    3: df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('hdfs://path/')
    4: df.write.format('com.databricks.spark.csv').save('hdfs://path/df.csv')
    
    (All above didn't work, Failed to find data source)
    

    or:

    def toCSVLine(data):
        return ','.join(str(d) for d in data)
    
    lines = df.rdd.map(toCSVLine)
    lines.saveAsTextFile('hdfs://path/df.csv')  
    
    (Permission denied)
    

    Q:

    1, How to solve "Failed to find data source"?

    2, I used sudo to make the dictionary "/path" on hdfs, if I turn the dataframe to rdd, how to write the rdd to csv on hdfs?

    Thanks a lot!