Upload zip file using --archives option of spark-submit on yarn

16,862

Found the answer myself.

YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.

Moreover, we can make this more beautiful using the # syntax.

The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.

Edit:

This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.

Share:
16,862
Mo Tao
Author by

Mo Tao

just do IT

Updated on June 12, 2022

Comments

  • Mo Tao
    Mo Tao almost 2 years

    I have a directory with some model files and my application has to access these models files in local file system due to some reason.

    Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.

    However, I want keep the directory structure of my files so I come up with --archives option, which is said

    YARN-only:
    ......
    --archives ARCHIVES         Comma separated list of archives to be extracted into the working directory of each executor.
    ......
    

    But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?

  • Little Bobby Tables
    Little Bobby Tables almost 7 years
    This has just been a life saver. Was it documented anywhere?!
  • Mo Tao
    Mo Tao almost 7 years
    Glad it helped. I found no document about this and I think this should appear in spark-submit -h.
  • Brad Hunter
    Brad Hunter almost 7 years
    This saved me too. Best answer on stackoverflow. By the way, from what I could tell it didn't extract the file UNLESS I added # and an alias. Maybe it was the version of spark or something strange. But I recommend just adding the # alias for anyone struggling with this.
  • Rakesh SKadam
    Rakesh SKadam almost 6 years
    Is there any way of extracting zip without adding # ?
  • Penumbra
    Penumbra over 5 years
    @RakeshSKadam The # is just to create an alias to make it easier for you to reference the files in your jobs/scripts. Without a "#" it will just extract the zip into a folder of the same name as the zip file - like the OP indicates.
  • Valli69
    Valli69 about 5 years
    Hi @MoTao - I have similar problem. I have spark-submit command like 'spark-submit --master yarn-client --driver-memory 4g --py-files /home/valli/pyFiles.zip --archives /home/valli/sql.zip#sqls /home/valli/main.py --sqls-path /home/valli/sqls'. But still I'm getting 'FileNotFound' exception when I trying to access sql files in zip folder. Please help me on this. Thanks in advance.
  • hopeIsTheonlyWeapon
    hopeIsTheonlyWeapon almost 4 years
    @MoTao does this #appSees.txt option work with s3 as source ? I am trying to spark-submit as "spark-submit","--master","yarn","--jars","s3://xxx.jar","--‌​py-files","s3://xxx.‌​py","--archives","s3‌​://xxxutils.zip#util‌​s","s3://xxx.py","--‌​deploy-mode","cluste‌​r"