Using spark-submit with python main

apache-spark pyspark

61,361

One way is to have a main driver program for your Spark application as a python file (.py) that gets passed to spark-submit. This primary script has the main method to help the Driver identify the entry point. This file will customize configuration properties as well initialize the SparkContext.

The ones bundled in the egg executables are dependencies that are shipped to the executor nodes and imported inside the driver program.

You can script a small file as main driver and execute -

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg driver.py arg1 arg

The driver program would be something like -

from pyspark import SparkContext, SparkConf
from my-module import myfn

if __name__ == '__main__':
    conf = SparkConf().setAppName("app")
    sc = SparkContext(conf=conf)
    myfn(myargs, sc)

Pass the SparkContext object as arguments wherever necessary.

61,361

Author by

XapaJIaMnu

Updated on July 09, 2022

Comments

XapaJIaMnu almost 2 years
Reading at this and this makes me think it is possible to have a python file be executed by spark-submit however I couldn't get it to work.

My setup is a bit complicated. I require several different jars to be submitted together with my python files in order for everything to function. My pyspark command which works is the following:
```
IPYTHON=1 ./pyspark --jars jar1.jar,/home/local/ANT/bogoyche/dev/rhine_workspace/env/Scala210-1.0/runtime/Scala2.10/scala-library.jar,jar2.jar --driver-class-path jar1.jar:jar2.jar
from sys import path
path.append('my-module')
from my-module import myfn
myfn(myargs)
```
I have packaged my python files inside an egg, and the egg contains the main file, which makes the egg executable by calling python myegg.egg

I am trying now to form my spark-submit command and I can't seem to get it right. Here's where I am:
```
./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg arg1 arg
Error: Cannot load main class from JAR file:/path/to/pyspark/directory/arg1
Run with --help for usage help or --verbose for debug output
```
Instead of executing my .egg file, it is taking the first argument of the egg and considers it a jar file and tries to load a class from it? What am I doing wrong?
XapaJIaMnu almost 8 years

Thanks. I just about figured that out. However it only works on my local machine and not on a Map reduce cluster ;/ For some reason as soon as I provide --driver-class-path option with extra jars, i override the default one (i think) and as a result I get an error py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoSuchMethodError: org.apache.http.impl.client.DefaultHttpClient.execute(Lorg/a‌pache/http/client/me‌thods/HttpUriRequest‌;)Lorg/apache/http/c‌lient/methods/Closea‌bleHttpResponse; Ideas?
Shantanu Alshi almost 8 years

I'm afraid it is hard to analyse the exact issue in this case without looking at your actual program and environment variables
XapaJIaMnu almost 8 years

The problem is a version conflict between the versions from my fat jar and the spark jar. Thanks for your help!
thebeancounter over 6 years

does this code have to run on the driver itself? can it run on another machine and execute something on the driver in any way?
Rohit almost 5 years

I have created .egg file and my main file(driver.py) file and now i'm trying to submit my spark application on local using command (spark-submit --py-files local:///C:/git_local/sparkETL/dist/sparkETL-0.1-py3.6.egg driver.py ) But i'm getting error (File not found c:\spark\bin\driver.py). Now i'm confused i have created egg file now it should look for driver.py inside .egg but why it is looking in local path please help
mj3c about 4 years

As @Rohit mentioned, it is annoying that the main.py cannot be in the .egg file as well, or at least I have not managed to make it work. It is weird that we have to package everything and leave 1 file out of the package to execute the command.