Unable to load AWS credentials from any provider in the chain - error - when trying to load model from S3

amazon-web-services apache-spark amazon-s3 pyspark apache-spark-mllib

16,615

Solution 1

You are using the wrong property names for the s3a connector.

see https://hadoop.apache.org/docs/current3/hadoop-aws/tools/hadoop-aws/#Authentication_properties

Specifically:

fs.s3a.access.key your access key
fs.s3a.secret.key your secret key

Note in particular

it's lower case
there are dots/periods between access and key, secret and key

The mixedCaseOptions are from the s3n connector which is obsolete and has long been deleted from the hadoop codebase. the s3a connector will simply ignore them

Solution 2

The AWS Java SDK has a credential resolution logic/chain to properly resolve the AWS credentials to use when interfacing with AWS services.

See http://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html

This error means the SDK could not find credentials in any of the places the SDK looks at. Make sure the credentials exist in at least one of the places mentioned in the above link.

As a starting point, populate environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The AWS SDK for Java uses the EnvironmentVariableCredentialsProvider class to load these credentials.

16,615

yguw

Updated on June 04, 2022

Comments

yguw almost 2 years

I have an MLLib model saved in a folder on S3, say bucket-name/test-model. Now, I have a spark cluster (let's say on a single machine for now). I am running the following commands to load the model:

pyspark --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

Then,

sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.awsAccessKeyId", AWS_ACCESS_KEY)
hadoopConf.set("fs.s3a.awsSecretAccessKey", AWS_SECRET_KEY)
hadoopConf.set("fs.s3a.endpoint", "s3.us-east-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
m1 = RandomForestClassificationModel.load('s3a://test-bucket/test-model')

and I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.6/site-packages/pyspark/ml/util.py", line 362, in load
    return cls.read().load(path)
  File "/home/user/.local/lib/python3.6/site-packages/pyspark/ml/util.py", line 300, in load
    java_obj = self._jread.load(path)
  File "/home/user/.local/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/user/.local/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/user/.local/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
    at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
    at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1378)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
    at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:615)
    at org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:427)
    at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:316)
    at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:306)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Honestly, these lines of code are taken from the web and I have no idea about storing and loading MLLib models on to S3. Any help here will be appreciated and also the next step for me is to do the same on a cluster of machines. So any heads up will also be appreciated.

Harsh Bafna over 4 years

try adding export AWS_ACCESS_KEY_ID="XXX" export AWS_SECRET_ACCESS_KEY="YYYYY" in spark-env.sh

Gerardo Figueroa almost 4 years

And what are the correct property names for the s3a connector?
stevel almost 4 years

they're the ones in the documentation I linked to. copying them here only ensures that stack overflow posts end up full of out of date facts
stevel almost 2 years

no, this doesn't work. you are setting the wrong key names. it only appears to work because spark launcher picks up those same env vars and uses them correctly. please read the hadoop s3a docs. thanks