How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

26,364
  • writing DataFrame to HDFS (Spark 1.6).

    df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
    

some of the format options are csv, parquet, json etc.

  • reading DataFrame from HDFS (Spark 1.6).

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    sqlContext.read.format('parquet').load('/path/to/file') 
    

the format method takes argument such as parquet, csv, json etc.

Share:
26,364
Ajg
Author by

Ajg

Updated on January 11, 2020

Comments

  • Ajg
    Ajg over 4 years

    I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.

  • Ajg
    Ajg almost 7 years
    Hey I get attributError : DataFrameWriter' object has no attribute 'csv. Also I need to read that dataframe later that is I think in new spark session.
  • rogue-one
    rogue-one almost 7 years
    what is the version of your spark installation?
  • Ajg
    Ajg almost 7 years
    spark version 1.6.1
  • Ajg
    Ajg almost 7 years
    Thanks a lot. I have one doubt, while reading what if there are multiple files in that location. How to specify which file I want to read. Thanks
  • rogue-one
    rogue-one almost 7 years
    if you want to read only one file among many. you will have to just specify the full file path. if you want to read all the files you can use glob patterns like * in the path.
  • Ajg
    Ajg almost 7 years
    Thanks. Will try that.
  • Ajg
    Ajg almost 7 years
    Sorry for one more question: Can you please tell how to delete those dataframes from HDFS afterwards.
  • rogue-one
    rogue-one almost 7 years
    to delete the data from hdfs you can use HDFS shell commands like hdfs dfs -rm -rf <path>. you can execute this using python subprocess like subprocess.call(["hdfs", "dfs", "-rm", "-rf", path])
  • ERJAN
    ERJAN almost 4 years
    what is target path? where does hdfs actually live on my pc?