Pyspark AWS credentials
Solution 1
For pyspark we can set the credentials as given below
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)
Solution 2
Setting spark.hadoop.fs.s3a.access.key
and spark.hadoop.fs.s3a.secret.key
in spark-defaults.conf
before establishing a spark session is a nice way to do it.
But, also had success with Spark 2.3.2 and a pyspark shell setting these dynamically from within a spark session doing the following:
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)
And then, able to read/write from S3 using s3a
:
documents = spark.sparkContext.textFile('s3a://bucket_name/key')
Solution 3
I'm not sure if this was true at the time, but as of PySpark 2.4.5 you don't need to access the private _jsc object to set Hadoop properties. You can set Hadoop properties using SparkConf.set(). For example:
import pyspark
conf = (
pyspark.SparkConf()
.setAppName('app_name')
.setMaster(SPARK_MASTER)
.set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY)
.set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY)
)
sc = pyspark.SparkContext(conf=conf)
See https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration
Roee N
Updated on July 22, 2022Comments
-
Roee N almost 2 years
I'm trying to run a PySpark script that works fine when I run it on my local machine. The issue is that I want to fetch the input files from S3.
No matter what I try though I can't seem to be able find where I set the ID and secret. I found some answers regarding specific files ex: Locally reading S3 files through Spark (or better: pyspark) but I want to set the credentials for the whole SparkContext as I reuse the sql context all over my code.
so the question is: How do I set the AWS Access key and secret to spark?
P.S I tried the $SPARK_HOME/conf/hdfs-site.xml and Environment variable options. both didn't work...
Thank you
-
Roee N over 6 yearsThanks, this seem to do it
-
Roee N over 6 yearsI was looking for the answer in PySpark... thanks though, it looks like a correct answer
-
AlexM over 6 yearsSorry about that, I'm so used to working in Java, I completely forgot you asked for Pyspark :)
-
Roee N over 6 yearsJust for future people looking for this, keep in mind that sc is the SparkContext: sc = SparkContext.getOrCreate(conf)
-
Lucian Thorr about 5 yearsThe keys "fs.s3a.access.key" and "fs.s3a.secret.key" worked for me where using "fs.s3a.awsAccessKeyId" and "fs.s3a.awsSecretAccessKey" did not.