Spark submit to yarn as a another user

16,656

Solution 1

For a non-kerberized cluster: export HADOOP_USER_NAME=zorro before submitting the Spark job will do the trick.
Make sure to unset HADOOP_USER_NAME afterwards, if you want to revert to your default credentials in the rest of the shell script (or in your interactive shell session).

For a kerberized cluster, the clean way to impersonate another account without trashing your other jobs/sessions (that probably depend on your default ticket) would be something in this line...

export KRB5CCNAME=FILE:/tmp/krb5cc_$(id -u)_temp_$$
kinit -kt ~/.protectedDir/zorro.keytab [email protected]
spark-submit ...........
kdestroy

Solution 2

For a non-kerberized cluster you can add a Spark conf as:

--conf spark.yarn.appMasterEnv.HADOOP_USER_NAME=<user_name>

Solution 3

Another (much safer) approach is to use proxy authentication - basically you create a service account and then allow it to impersonate to other users.

$ spark-submit --help 2>&1 | grep proxy
  --proxy-user NAME           User to impersonate when submitting the application.

Assuming Kerberized / secured cluster.

I mentioned it's much safer because you don't need to store (and manage) keytabs of alll users you will have to impersonate to.

To enable impersonation, there are several settings you'd need to enable on Hadoop side to tell which account(s) can impersonate which users or groups and on which servers. Let's say you have created svc_spark_prd service account/ user.

hadoop.proxyuser.svc_spark_prd.hosts - list of fully-qualified domain names for servers which are allowed to submit impersonated Spark applications. * is allowed but nor recommended for any host.

Also specify either hadoop.proxyuser.svc_spark_prd.users or hadoop.proxyuser.svc_spark_prd.groups to list users or groups that svc_spark_prd is allowed to impersonate. * is allowed but not recommended for any user/group.

Also, check out documentation on proxy authentication.

Apache Livy for example uses this approach to submit Spark jobs on behalf of other end users.

Solution 4

If your user exists, you can still launch your spark submit with su $my_user -c spark submit [...]

I am not sure about the kerberos keytab, but if you make a kinit with this user it should be fine.

If you can't use su because you don't want the password, I invite you to see this stackoverflow answer: how to run script as another user without password

Share:
16,656
Benjamin
Author by

Benjamin

Updated on July 18, 2022

Comments

  • Benjamin
    Benjamin almost 2 years

    Is it possible to submit a spark job to a yarn cluster and choose, either with the command line or inside the jar, which user will "own" the job?

    The spark-submit will be launch from a script containing the user.

    PS: is it still possible if the cluster has a kerberos configuration (and the script a keytab) ?

  • Benjamin
    Benjamin over 7 years
    I will not be able to su to another user. The user that will launch spark-submit will be like www-data so su will not be possible and it will node be able to do a kinit as it require the final user password.
  • kulssaka
    kulssaka over 7 years
    the one who launch the spark job is the owner. su -c will not change your user, only it will run the job as the user your selected; edit: ok, I modified my post
  • Samson Scharfrichter
    Samson Scharfrichter over 7 years
  • Samson Scharfrichter
    Samson Scharfrichter about 7 years
    Did you test that in both yarn-client and yarn-cluster modes?
  • Tagar
    Tagar over 5 years
    this answer is okay, although assumes that someone has access to all keytabs of all users that you might have to run a spark job as... that's not always feasible and always not secure. also managing keytabs might be a nightmare (e.g. when a user changes a password, that keytab has to be changed etc). a better way is to use proxy authentication - see my answer below on this.
  • Samson Scharfrichter
    Samson Scharfrichter over 5 years
    this answer is okay, although assumes that the privileged "proxy" account is not usable by anyone at any time -- otherwise you have no authentication.
  • Tagar
    Tagar over 5 years
    that's exactly right. such special service accounts in our organization are pretty much locked up (e.g. they can't have a unix session etc), and we only use them to authenticate in Kerberos such an impersonation service for Spark jobs. vs. your approach assumes that someone has access to all keytabs (or perhaps maybe even to passwords) of all users that you might have to run a spark job as... that's not always feasible and always not secure. Also managing keytabs might be a nightmare (e.g. when a user changes a password, that keytab has to be changed etc). proxy authenticationis more secure
  • Samson Scharfrichter
    Samson Scharfrichter over 5 years
    Oh, please, stop calling that "my" approach. It's the "solve your own problem all by yourself without admin privileges" approach. Plus, the question is 2 years old -- are you sure Spark supported proxy accounts by then??
  • Tagar
    Tagar over 5 years
    proxy-user was added to Spark in release 1.3.0 almost 4 years ago by this commit - github.com/apache/spark/commit/… on Feb/2015 :-)
  • Felipe Gonzalez
    Felipe Gonzalez over 4 years
    This is not working for me, could you add a link to some documentation where this is stated? I cant find it anywhere
  • Tharaka
    Tharaka almost 4 years
    export HADOOP_USER_NAME=hadoop worked with AWS EMR.