Passing in Kerberos keytab/principal via SparkLauncher
Solution 1
--principal
arg is described as "Principal to be used to login to KDC, while running on secure HDFS".
So it is specific to Hadoop integration. I'm not sure you are aware of that, because your post does not mention either Hadoop, YARN or HDFS.
Now, Spark properties that are Hadoop-specific are described on the manual page Running on YARN. Surprise! Some of these properties sound familiar, like spark.yarn.principal
and spark.yarn.keytab
Bottom line: the --blahblah
command-line arguments are just shortcuts to properties that you can otherwise set in your code, or in the "spark-defaults" conf file.
Solution 2
Since Samson's answer, I thought I'd add what I've experienced with Spark 1.6.1:
- You could use
SparkLauncher.addSparkArg("--proxy-user", userName)
to send in proxy user info. - You could use
SparkLauncher.addSparkArg("--principal", kerbPrincipal)
andSparkLauncher.addSparkArg("--keytab", kerbKeytab)
- So, you can only use either (a) OR (b) but not both together - see https://github.com/apache/spark/pull/11358/commits/0159499a55591f25c690bfdfeecfa406142be02b
- In other words, either the launched process triggers a Spark job on YARN as itself, using its Kerberos credentials (OR) the launched process impersonates an end user to trigger the Spark job on a cluster without Kerberos. On YARN, in case of the former, the job is owned by self while in case of the former, the job is owned by the proxied user.
Sudarshan Thitte
Updated on July 15, 2022Comments
-
Sudarshan Thitte almost 2 years
spark-submit
allows us to pass in Kerberos credentials via the--keytab
and--principal
options. If I try to add these viaaddSparkArg("--keytab",keytab)
, I get a'--keytab' does not expect a value
error - I presume this is due to lack of support as of v1.6.0.Is there another way by which I can submit my Spark job using this SparkLauncher class, with Kerberos credentials ? - I'm using Yarn with Secured HDFS.
-
Samson Scharfrichter over 8 yearsWell, some of these arguments are used to "digest" the actual Java command line that will run the driver. So they cannot be set in the conf file (which is read by the driver; too late to change some JVM settings).
-
Sudarshan Thitte over 8 yearsI agree - I just wish SparkLauncher had been released at feature parity to
spark-submit
. The other option is to use Process or commons-exec to manually invokespark-submit
script within a JVM. Some have also recommended the use of Oozie, but that brings with it additional overhead of dependency/workflow spec/mgmt. -
Samson Scharfrichter over 8 yearsSo, yes, I stand corrected. Most command-line arguments are just shortcuts to properties. For the rest, you can try to trace the exact Java command-line that is produced, analyze the args, and re-assemble your own command-line. I tried once... and soon stopped (quite tricky, not a priority at the time).
-
Sudarshan Thitte over 8 yearsI see.. yep, it is tricky. Using Process is not exactly convenient given I would have to handle the stream buffers and waits myself. I think I'll fall back to using SparkSubmit directly for now. Worst case, I'll look into commons-exec.
-
Sudarshan Thitte over 8 yearsI stand corrected regarding my comment about parity of SparkLauncher. https://issues.apache.org/jira/browse/SPARK-9074 - I'll update this post if I see this PR in action as I understand it.
-
Samson Scharfrichter about 8 yearsAFAIK the only account that has this kind of "proxy" privileges in the Hadoop ecosystem is
oozie
(just like CRON daemon does "su" on Linux, Oozie does on Yarn). Do you think this Spark "proxy-user" property relates to Oozie, or to something completely different?. -
Sudarshan Thitte about 8 yearsRight - it doesn't relate to Oozie, rather relates to how Oozie impersonates the end user (akin to the su on Linux). Essentially your HDFS service is setup to allow impersonation by the person invoking SparkLauncher with this proxy-user property, of the user it wants to impersonate as, when talking to services on your cluster (HDFS, for data access, YARN for job submission, etc.) See User impersonation when using Hadoop for how to enable user impersonation on your HDFS service.