Spark read file from S3 using sc.textFile ("s3n://...)

109,453

Solution 1

Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.

sc.textFile("s3n://bucketname/Filename") now raises another error:

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).

scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21

scala> lyrics.count
res1: Long = 9

Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count

Solution 2

Despite that this question has already an accepted answer, I think that the exact details of why this is happening are still missing. So I think there might be a place for one more answer.

If you add the required hadoop-aws dependency, your code should work.

Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws. There is also a Jira for that: Move s3-related FS connector code to hadoop-aws.

This means that any version of spark, that has been built against Hadoop 2.6.0 or newer will have to use another external dependency to be able to connect to the S3 File System.
Here is an sbt example that I have tried and is working as expected using Apache Spark 1.6.2 built against Hadoop 2.6.0:

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"

In my case, I encountered some dependencies issues, so I resolved by adding exclusion:

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0" exclude("tomcat", "jasper-compiler") excludeAll ExclusionRule(organization = "javax.servlet")

On other related note, I have yet to try it, but that it is recommended to use "s3a" and not "s3n" filesystem starting Hadoop 2.6.0.

The third generation, s3a: filesystem. Designed to be a switch in replacement for s3n:, this filesystem binding supports larger files and promises higher performance.

Solution 3

You can add the --packages parameter with the appropriate jar: to your submission:

bin/spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 code.py

Solution 4

I had to copy the jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.

Details:

  • Spark 2.3.0
  • Hadoop downloaded was 2.7.6
  • Two jar files copied were from (hadoop dir)/share/hadoop/tools/lib/
    • aws-java-sdk-1.7.4.jar
    • hadoop-aws-2.7.6.jar

Solution 5

This is a sample spark code which can read the files present on s3

val hadoopConf = sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", s3Key)
hadoopConf.set("fs.s3.awsSecretAccessKey", s3Secret)
var jobInput = sparkContext.textFile("s3://" + s3_location)
Share:
109,453
Polymerase
Author by

Polymerase

Updated on April 14, 2020

Comments

  • Polymerase
    Polymerase about 4 years

    Trying to read a file located in S3 using spark-shell:

    scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
    lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12
    
    scala> myRdd.count
    java.io.IOException: No FileSystem for scheme: s3n
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
        ... etc ...
    

    The IOException: No FileSystem for scheme: s3n error occurred with:

    • Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
    • Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
    • Using s3:// or s3n:// scheme

    What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?

    Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.

  • Tristan Reid
    Tristan Reid over 8 years
    You add the path of the jar to spark-shell with the --jars parameter, comma-separated. You'll also want to add the aws-java-sdk-*-jar.
  • Priyank Desai
    Priyank Desai almost 8 years
    Spark 1.6.0 with Hadoop 2.4 worked for me. Spark 1.6.0 with Hadoop 2.6 didn't.
  • timss
    timss over 7 years
    @PriyankDesai For others with the same problem, see issues.apache.org/jira/browse/SPARK-7442 and the links in the comment section.
  • Sergey Bahchissaraitsev
    Sergey Bahchissaraitsev over 7 years
    See my answer below for the reason why it did not work with Hadoop 2.6 version.
  • jcomeau_ictx
    jcomeau_ictx over 7 years
    looked promising, but I get failed download for file:/home/jcomeau/.m2/repository/asm/asm/3.2/asm-3.2.jar when I do this with: spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.83,org.apache.hadoop:had‌​oop-aws:2.7.3 merge.py. any ideas?
  • Kieleth
    Kieleth about 7 years
    I ran into this problem also with Spark 2.1.0 and added the latest aws requirements (spark.jars.packages org.apache.hadoop:hadoop-aws:2.7.3) to "spark-defaults.conf", did the trick.
  • Shrikant Prabhu
    Shrikant Prabhu over 6 years
    Adding following to my SparkContext solved my problem code sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")code
  • Shrikant Prabhu
    Shrikant Prabhu over 6 years
    Note you shouldn't check in the code with your secret key and access key to your code repository. Ideal way is to let your cluster environment assume your IAMRole which has access to S3. I removed access and secret keys code from my program but forgot to remove following piece of code when running on Amazon EMR sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem"), and program started to fail again with above error.
  • Uri Goren
    Uri Goren over 6 years
    in python: AttributeError: 'SparkContext' object has no attribute 'hadoopConfiguration'
  • vaer-k
    vaer-k over 5 years
    @UriGoren In Python, hadoopConfiguration can be accessed through the java implementation: sc._jsc.hadoopConfiguration
  • Z.Wei
    Z.Wei over 4 years
    Is there any update as of today (Aug, 2019)? Thanks.
  • stevel
    stevel over 4 years
    yes. The spark-hadoop-cloud dependency pulls in what you need. It's not included in ASF releases though. mvnrepository.com/artifact/org.apache.spark/spark-hadoop-clo‌​ud