Running EMR Spark With Multiple S3 Accounts

12,132

Solution 1

The solution is actually quite simple.

Firstly, EMR clusters have two roles:

These roles are explained in: Default IAM Roles for Amazon EMR

Therefore, each EC2 instance launched in the cluster is assigned the EMR_EC2_DefaultRole role, which makes temporary credentials available via the Instance Metadata service. (For an explanation of how this works, see: IAM Roles for Amazon EC2.) Amazon EMR nodes use these credentials to access AWS services such as S3, SNS, SQS, CloudWatch and DynamoDB.

Secondly, you will need to add permissions to the Amazon S3 bucket in the other account to permit access via the EMR_EC2_DefaultRole role. This can be done by adding a bucket policy to the S3 bucket (here named other-account-bucket) like this:

{
    "Id": "Policy1",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::other-account-bucket",
                "arn:aws:s3:::other-account-bucket/*"
            ],
            "Principal": {
                "AWS": [
                    "arn:aws:iam::ACCOUNT-NUMBER:role/EMR_EC2_DefaultRole"
                ]
            }
        }
    ]
}

This policy grants all S3 permissions (s3:*) to the EMR_EC2_DefaultRole role that belongs to the account matching the ACCOUNT-NUMBER in the policy, which should be the account in which the EMR cluster was launched. Be careful when granting such permissions -- you might want to grant permissions only to GetObject rather than granting all S3 permissions.

That's all! The bucket in the other account will now accept requests from the EMR nodes because they are using the EMR_EC2_DefaultRole role.

Disclaimer: I tested the above by creating a bucket in Account-A and assigning permissions (as shown above) to a role in Account-B. An EC2 instance was launched in Account-B with that role. I was able to access the bucket from the EC2 instance via the AWS Command-Line Interface (CLI). I did not test it within EMR, however it should work the same way.

Solution 2

Using spark you can also use assume role to access an s3 bucket in another account but using an IAM Role in the other account. This makes it easier for the other account owner to manage the permissions provided to the spark job. Managing access via s3 bucket policies can be a pain as access rights are distributed to multiple locations rather than all contained within a single IAM role.

Here is the hadoopConfiguration:

"fs.s3a.credentialsType" -> "AssumeRole",
"fs.s3a.stsAssumeRole.arn" -> "arn:aws:iam::<<AWSAccount>>:role/<<crossaccount-role>>",
"fs.s3a.impl" -> "com.databricks.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.server-side-encryption-algorithm" -> "aws:kms",
"spark.hadoop.fs.s3a.server-side-encryption-kms-master-key-id" -> "arn:aws:kms:ap-southeast-2:<<AWSAccount>>:key/<<KMS Key ID>>"

External IDs can also be used as a passphrase:

"spark.hadoop.fs.s3a.stsAssumeRole.externalId" -> "GUID created by other account owner"

We were using databricks for the above have not tried using EMR yet.

Solution 3

I believe you need to assign an IAM role to your compute nodes (you probably already have done this), then grant cross-account access to that role via IAM on the "Remote" account. See http://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html for the details.

Share:
12,132
blakkheartt12
Author by

blakkheartt12

In the last fourteen years, I have worked within a broad spectrum of the technology field. At Veoh, we pioneered online video and advertising. I've built and maintained four world-class software engineering teams at Verve and Active. I worked as a go-between, functioning as the integral link between the company executive needs and technological development segment of the team. I have facilitated the companies vision and the technological expertise required, into a functioning and successful end products.

Updated on June 27, 2022

Comments

  • blakkheartt12
    blakkheartt12 almost 2 years

    I have an EMR Spark Job that needs to read data from S3 on one account and write to another.
    I split my job into two steps.

    1. read data from the S3 (no credentials required because my EMR cluster is in the same account).

    2. read data in the local HDFS created by step 1 and write it to an S3 bucket in another account.

    I've attempted setting the hadoopConfiguration:

    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<your access key>")
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<your secretkey>")
    

    And exporting the keys on the cluster:

    $ export AWS_SECRET_ACCESS_KEY=
    $ export AWS_ACCESS_KEY_ID=
    

    I've tried both cluster and client mode as well as spark-shell with no luck.

    Each of them returns an error:

    ERROR ApplicationMaster: User class threw exception: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: 
    Access Denied
    
    • John Rotenstein
      John Rotenstein over 7 years
      Be careful -- it looks like fs.s3n.awsAccessKeyId only applies to files accessed via s3n://bucket/file. That might not be how your system is reading from S3. See: Hadoop-AWS module: Integration with Amazon Web Services.
    • John Rotenstein
      John Rotenstein over 7 years
      Your statement "First, read data from the S3 ( no credentials required because my EMR cluster is in the same account )" is not quite accurate. The EMR cluster does require credentials to access Amazon S3 content. Credentials are passed-in via the Role associated with the cluster nodes. It is giving access to your own S3 buckets.
  • Tim Ludwinski
    Tim Ludwinski almost 5 years
    In addition to s3:GetObject permissions, you probably need to grant s3:ListBucket permissions on the bucket you wish to use. (And if your bucket is KMS encrypted don't forget KMS permissions).
  • JQ.
    JQ. about 4 years
    mate, you saved me