How to save a spark dataframe to csv on HDFS?

python csv apache-spark pyspark hdfs

13,236

You could try to change ".save" to ".csv":

df.coalesce(1).write.mode('overwrite').option('header','true').csv('hdfs://path/df.csv')

13,236

Author by

Leah210

Updated on June 18, 2022

Comments

Leah210 almost 2 years

Spark version: 1.6.1, I use pyspark API.

DataFrame: df, which has two colume.

I have tried:

1: df.write.format('csv').save("hdfs://path/bdt_sum_vol.csv")
2: df.write.save('hdfs://path/bdt_sum_vol.csv', format='csv', mode='append')
3: df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('hdfs://path/')
4: df.write.format('com.databricks.spark.csv').save('hdfs://path/df.csv')

(All above didn't work, Failed to find data source)

or:

def toCSVLine(data):
    return ','.join(str(d) for d in data)

lines = df.rdd.map(toCSVLine)
lines.saveAsTextFile('hdfs://path/df.csv')  

(Permission denied)

Q:

1, How to solve "Failed to find data source"?

2, I used sudo to make the dictionary "/path" on hdfs, if I turn the dataframe to rdd, how to write the rdd to csv on hdfs?

Thanks a lot!

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

How to write the resulting RDD to a csv file in Spark python

Load CSV file with Spark

with pyspark.sql.functions unix_timestamp get null

How to use correlation in Spark with Dataframes?

How to map features from the output of a VectorAssembler back to the column names in Spark ML?

How to run a function on all Spark workers before processing data in PySpark?

PySpark logging from the executor

How to resolve pickle error in pyspark?

PySpark: match the values of a DataFrame column against another DataFrame column

Pyspark count() and collect() do not work