Save a spark RDD to the local file system using Java

14,195

saveAsTextFile is able to take in local file system paths (e.g. file:///tmp/magic/...). However, if your running on a distributed cluster, you most likely want to collect() the data back to the cluster and then save it with standard file operations.

Share:
14,195
Kanav Sharma
Author by

Kanav Sharma

Updated on June 19, 2022

Comments

  • Kanav Sharma
    Kanav Sharma over 1 year

    I have a RDD that is generated using Spark. Now if I write this RDD to a csv file, I am provided with some methods like "saveAsTextFile()" which outputs a csv file to the HDFS.

    I want to write the file to my local file system so that my SSIS process can pick the files from the system and load them into the DB.

    I am currently unable to use sqoop.

    Is it somewhere possible in Java other than writing shell scripts to do that.

    Any clarity needed, please let know.

  • Kanav Sharma
    Kanav Sharma over 8 years
    okay. this method of passing the parameter with "file:///" returns successfully with a _SUCCESS file but no output files could be seen. I am running it on a distributed cluster, however my data is so much that calling collect() limits the JVM
  • abalcerek
    abalcerek over 8 years
    If your file is too big for one machine this does not really make much sense to saive it locally instead of hdfs or other distributed file system.
  • Kanav Sharma
    Kanav Sharma over 8 years
    Not the file size but the files count is pretty much. My process is actually designed to handle around 400GB of data per hour. @holden I have, for now, managed to do this using FileSystem.copyToLocalFile(). I have to check it for a day for reliability and I would have more information.
  • Kanav Sharma
    Kanav Sharma over 8 years
    @holden Let me know if the approach I am on needs modification.
  • Holden
    Holden over 8 years
    If your data is too big for the driver, then you will need to either store the data to HDFS (or similar distributed file system) - or if you still really want to store it on the driver then using toLocalIterator (but remember to cache the RDD before hand) will only need as much memory as the largest partition.
  • user239558
    user239558 about 8 years
    Missing the code to save this using standard file operations in this answer.