Accessing S3 from Spark 2.0

12,266

Solution 1

Actually all operations of spark working on workers. and you set these configuration on master so once you can try to app configuration of s3 on mapPartition{ }

Solution 2

If you want to use s3n:

sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey)

Now, regarding the exception, you need to make sure both JARs are on the driver and worker classpaths, and make sure to distribute them to the worker node if you're using Client Mode via the --jars flag:

spark-submit \
--conf "spark.driver.extraClassPath=/location/to/aws-java-sdk.jar" \
--conf "spark.driver.extraClassPath=/location/to/hadoop-aws.jar" \
--jars /location/to/aws-java-sdk.jar,/location/to/hadoop-aws.jar \

Also, if you're building your uber JAR and including aws-java-sdk and hadoop-aws, no reason to use the --packages flag.

Share:
12,266
elldekaa
Author by

elldekaa

World Traveller Open-Source Addict Researcher in Parallel Computing

Updated on July 03, 2022

Comments

  • elldekaa
    elldekaa almost 2 years

    I'm trying to access S3 file from SparkSQL job. I already tried solutions from several posts but nothing seems to work. Maybe because my EC2 cluster runs the new Spark2.0 for Hadoop2.7.

    I setup hadoop this way:

    sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", accessKey)
    sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", secretKey)
    

    I build an uber-jar using sbt assembly using:

    name := "test"
    version := "0.2.0"
    scalaVersion := "2.11.8"
    
    libraryDependencies += "com.amazonaws" % "aws-java-sdk" %   "1.7.4"
    libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll(
        ExclusionRule("com.amazonaws", "aws-java-sdk"),
        ExclusionRule("commons-beanutils")
    )
    
    libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
    libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"
    

    When I submit my job to the cluster, I always got the following errors:

    Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 172.31.7.246): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1726) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:662) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:446) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:476)

    It seems that the driver is able to read from S3 without problem but not the workers/executors... I do not understand why my uberjar is not sufficient.

    However, I tried as well without success to configure spark-submit using:

    --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

    PS: If I switch to s3n protocol, I got the following exception:

    java.io.IOException: No FileSystem for scheme: s3n