Write to a file in S3 using Spark on EMR

15,775

Try doing this:

rdd.coalesce(1, shuffle = true).saveAsTextFile(...)

My understanding is that the shuffle = true argument will cause this to occur in parallel so it will output a single text file, but do be careful with massive data files.

Here are some more details on this issue at hand.

Share:
15,775
Daniel Kats
Author by

Daniel Kats

Principal Researcher @ NortonLifeLock Research Group. MSc and BSc at University of Toronto. Previously Yelp, IBM.

Updated on June 04, 2022

Comments

  • Daniel Kats
    Daniel Kats almost 2 years

    I use the following Scala code to create a text file in S3, with Apache Spark on AWS EMR.

    def createS3OutputFile() {
        val conf = new SparkConf().setAppName("Spark Pi")
        val spark = new SparkContext(conf)
        // use s3n !
        val outputFileUri = s"s3n://$s3Bucket/emr-output/test-3.txt"
        val arr = Array("hello", "World", "!")
        val rdd = spark.parallelize(arr)
        rdd.saveAsTextFile(outputFileUri)
        spark.stop()
      }
    
    def main(args: Array[String]): Unit = {
        createS3OutputFile()
      }
    

    I create a fat JAR and upload it to S3. I then SSH into the cluster master and run the code with:

    spark-submit \
        --deploy-mode cluster \
        --class "$class_name" \
        "s3://$s3_bucket/$app_s3_key"
    

    I am seeing this in the S3 console: instead of files there are folders.

    enter image description here

    Each folder (for example test-3.txt) contains a long list of block files. Picture below:

    enter image description here

    How do I output a simple text file to S3 as the output of my Spark job?