Write to a file in S3 using Spark on EMR

scala amazon-web-services apache-spark amazon-s3 amazon-emr

15,775

Try doing this:

rdd.coalesce(1, shuffle = true).saveAsTextFile(...)

My understanding is that the shuffle = true argument will cause this to occur in parallel so it will output a single text file, but do be careful with massive data files.

Here are some more details on this issue at hand.

15,775

Author by

Daniel Kats

Principal Researcher @ NortonLifeLock Research Group. MSc and BSc at University of Toronto. Previously Yelp, IBM.

Updated on June 04, 2022

Comments

Daniel Kats almost 2 years

I use the following Scala code to create a text file in S3, with Apache Spark on AWS EMR.

def createS3OutputFile() {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    // use s3n !
    val outputFileUri = s"s3n://$s3Bucket/emr-output/test-3.txt"
    val arr = Array("hello", "World", "!")
    val rdd = spark.parallelize(arr)
    rdd.saveAsTextFile(outputFileUri)
    spark.stop()
  }

def main(args: Array[String]): Unit = {
    createS3OutputFile()
  }

I create a fat JAR and upload it to S3. I then SSH into the cluster master and run the code with:

spark-submit \
    --deploy-mode cluster \
    --class "$class_name" \
    "s3://$s3_bucket/$app_s3_key"

I am seeing this in the S3 console: instead of files there are folders.

Each folder (for example test-3.txt) contains a long list of block files. Picture below:

How do I output a simple text file to S3 as the output of my Spark job?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Extremely slow S3 write times from EMR/ Spark

Spark Streaming on a S3 Directory

NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities while reading s3 Data with spark

Not able to write Spark SQL DataFrame to S3

overwrite hive partitions using spark

How to use AWS Glue / Spark to convert CSVs partitioned and split in S3 to partitioned and split Parquet

Unable to load AWS credentials from any provider in the chain - error - when trying to load model from S3

Scala & DataBricks: Getting a list of Files

Trying to read and write parquet files from s3 with local spark

List files on S3

Write to a file in S3 using Spark on EMR

Daniel Kats

Comments

Recents

Related