Running Spark jobs on a YARN cluster with additional files

12,306

Solution 1

I don't use python myself but I find some clues may be useful for you (in the source code of Spark-1.3 SparkSubmitArguments)

  • --py-files PY_FILES, Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.

  • --files FILES, Comma-separated list of files to be placed in the working directory of each executor.

  • --archives ARCHIVES, Comma separated list of archives to be extracted into the working directory of each executor.

And also, your arguments to spark-submit should follow this style:

Usage: spark-submit [options] <app jar | python file> [app arguments]

Solution 2

To understand why, you must get familiar with the differences of the three running mode of spark, eg. standalone, yarn-client, yarn-cluster.

As with standalone and yarn-client, driver program runs at the current location of your local machine while worker program runs somewhere else(standalone maybe another temp directory under $SPARK_HOME, yarn-client maybe a random node in the cluster), so you can access local file with local path specified in the driver program but not in the worker program.

However, when you run with yarn-cluster mode, both your driver and worker program run at a random cluster node, local files are relative to their working machine and directory, thereby a file-not-found exception throws, you need to archive these files with either --files or --archive when submitting, or just archive them in .egg or .jar yourself before submit, or use addFile api in your driver program like this.

Solution 3

You may want to try and use local:// and the $SPARK_YARN_STAGING_DIR env var.

For example the following should work:

spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --files /absolute/path/to/local/test.py \
    --class somepackage.PythonLauncher \
    local://$SPARK_YARN_STAGING_DIR/test.py
Share:
12,306

Related videos on Youtube

Alexander Tokarev
Author by

Alexander Tokarev

Full stack web/enterprise systems developer.

Updated on September 15, 2022

Comments

  • Alexander Tokarev
    Alexander Tokarev over 1 year

    I'm writing a simple spark application that uses some input RDD, sends it to an external script via pipe, and writes an output of that script to a file. Driver code looks like this:

    val input = args(0)
    val scriptPath = args(1)
    val output = args(2)
    val sc = getSparkContext
    if (args.length == 4) {
      //Here I pass an additional argument which contains an absolute path to a script on my local machine, only for local testing
      sc.addFile(args(3))
    }
    
    sc.textFile(input).pipe(Seq("python2", SparkFiles.get(scriptPath))).saveAsTextFile(output)
    

    When I run it on my local machine it works fine. But when I submit it to a YARN cluster via

    spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path/to/driver.jar path/to/input/part-* test.py path/to/output` 
    

    it fails with an exception.

    Lost task 1.0 in stage 0.0 (TID 1, rwds2.1dmp.ru): java.lang.Exception: Subprocess exited with status 2
    

    I've tried different variations of the pipe command. For instance, .pipe("cat") works fine, and behaves as expected, but .pipe(Seq("cat", scriptPath)) also fails with error code 1, so it seems that spark can't figure out a path to the script on a cluster node.

    Any suggestions?

    • Irene
      Irene over 8 years
      Any updates on this?
  • Mikel Urkia
    Mikel Urkia almost 9 years
    Agreed. I'd say --files FILES is what he actually needs to send the file to each executor.
  • Alexander Tokarev
    Alexander Tokarev almost 9 years
    That's not what I'm trying to do. A file that I pass with --files parameter successfully uploads to .sparkStaging directory on HDFS. All that I want is to access this file while my job is running on cluster from every cluster node via SparkFiles.get().
  • Anuja Khemka
    Anuja Khemka over 6 years
    @AlexanderTokarev any updates on this? I am trying the same but it fails.
  • Anuja Khemka
    Anuja Khemka over 6 years
    I want to be able to send my --files have them in the same working dir
  • Brad
    Brad over 6 years
    I posted an answer to a similar question on how to send multiple files that may be of interest. It's Java centric and uses the --files arguement to achieve sending multiple properties files