Passing in Kerberos keytab/principal via SparkLauncher

14,847

Solution 1

--principal arg is described as "Principal to be used to login to KDC, while running on secure HDFS".

So it is specific to Hadoop integration. I'm not sure you are aware of that, because your post does not mention either Hadoop, YARN or HDFS.

Now, Spark properties that are Hadoop-specific are described on the manual page Running on YARN. Surprise! Some of these properties sound familiar, like spark.yarn.principal and spark.yarn.keytab

Bottom line: the --blahblah command-line arguments are just shortcuts to properties that you can otherwise set in your code, or in the "spark-defaults" conf file.

Solution 2

Since Samson's answer, I thought I'd add what I've experienced with Spark 1.6.1:

  1. You could use SparkLauncher.addSparkArg("--proxy-user", userName) to send in proxy user info.
  2. You could use SparkLauncher.addSparkArg("--principal", kerbPrincipal) and SparkLauncher.addSparkArg("--keytab", kerbKeytab)
  3. So, you can only use either (a) OR (b) but not both together - see https://github.com/apache/spark/pull/11358/commits/0159499a55591f25c690bfdfeecfa406142be02b
  4. In other words, either the launched process triggers a Spark job on YARN as itself, using its Kerberos credentials (OR) the launched process impersonates an end user to trigger the Spark job on a cluster without Kerberos. On YARN, in case of the former, the job is owned by self while in case of the former, the job is owned by the proxied user.
Share:
14,847
Sudarshan Thitte
Author by

Sudarshan Thitte

Updated on July 15, 2022

Comments

  • Sudarshan Thitte
    Sudarshan Thitte almost 2 years

    spark-submit allows us to pass in Kerberos credentials via the --keytab and --principal options. If I try to add these via addSparkArg("--keytab",keytab) , I get a '--keytab' does not expect a value error - I presume this is due to lack of support as of v1.6.0.

    Is there another way by which I can submit my Spark job using this SparkLauncher class, with Kerberos credentials ? - I'm using Yarn with Secured HDFS.

  • Samson Scharfrichter
    Samson Scharfrichter over 8 years
    Well, some of these arguments are used to "digest" the actual Java command line that will run the driver. So they cannot be set in the conf file (which is read by the driver; too late to change some JVM settings).
  • Sudarshan Thitte
    Sudarshan Thitte over 8 years
    I agree - I just wish SparkLauncher had been released at feature parity to spark-submit. The other option is to use Process or commons-exec to manually invoke spark-submit script within a JVM. Some have also recommended the use of Oozie, but that brings with it additional overhead of dependency/workflow spec/mgmt.
  • Samson Scharfrichter
    Samson Scharfrichter over 8 years
    So, yes, I stand corrected. Most command-line arguments are just shortcuts to properties. For the rest, you can try to trace the exact Java command-line that is produced, analyze the args, and re-assemble your own command-line. I tried once... and soon stopped (quite tricky, not a priority at the time).
  • Sudarshan Thitte
    Sudarshan Thitte over 8 years
    I see.. yep, it is tricky. Using Process is not exactly convenient given I would have to handle the stream buffers and waits myself. I think I'll fall back to using SparkSubmit directly for now. Worst case, I'll look into commons-exec.
  • Sudarshan Thitte
    Sudarshan Thitte over 8 years
    I stand corrected regarding my comment about parity of SparkLauncher. https://issues.apache.org/jira/browse/SPARK-9074 - I'll update this post if I see this PR in action as I understand it.
  • Samson Scharfrichter
    Samson Scharfrichter about 8 years
    AFAIK the only account that has this kind of "proxy" privileges in the Hadoop ecosystem is oozie (just like CRON daemon does "su" on Linux, Oozie does on Yarn). Do you think this Spark "proxy-user" property relates to Oozie, or to something completely different?.
  • Sudarshan Thitte
    Sudarshan Thitte about 8 years
    Right - it doesn't relate to Oozie, rather relates to how Oozie impersonates the end user (akin to the su on Linux). Essentially your HDFS service is setup to allow impersonation by the person invoking SparkLauncher with this proxy-user property, of the user it wants to impersonate as, when talking to services on your cluster (HDFS, for data access, YARN for job submission, etc.) See User impersonation when using Hadoop for how to enable user impersonation on your HDFS service.