Using spark-submit with python main
One way is to have a main driver program for your Spark application as a python file (.py) that gets passed to spark-submit. This primary script has the main method to help the Driver identify the entry point. This file will customize configuration properties as well initialize the SparkContext.
The ones bundled in the egg executables are dependencies that are shipped to the executor nodes and imported inside the driver program.
You can script a small file as main driver and execute -
./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg driver.py arg1 arg
The driver program would be something like -
from pyspark import SparkContext, SparkConf
from my-module import myfn
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
myfn(myargs, sc)
Pass the SparkContext
object as arguments wherever necessary.
XapaJIaMnu
Updated on July 09, 2022Comments
-
XapaJIaMnu almost 2 years
Reading at this and this makes me think it is possible to have a python file be executed by
spark-submit
however I couldn't get it to work.My setup is a bit complicated. I require several different jars to be submitted together with my python files in order for everything to function. My
pyspark
command which works is the following:IPYTHON=1 ./pyspark --jars jar1.jar,/home/local/ANT/bogoyche/dev/rhine_workspace/env/Scala210-1.0/runtime/Scala2.10/scala-library.jar,jar2.jar --driver-class-path jar1.jar:jar2.jar from sys import path path.append('my-module') from my-module import myfn myfn(myargs)
I have packaged my python files inside an egg, and the egg contains the main file, which makes the egg executable by calling
python myegg.egg
I am trying now to form my
spark-submit
command and I can't seem to get it right. Here's where I am:./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg arg1 arg Error: Cannot load main class from JAR file:/path/to/pyspark/directory/arg1 Run with --help for usage help or --verbose for debug output
Instead of executing my .egg file, it is taking the first argument of the egg and considers it a jar file and tries to load a class from it? What am I doing wrong?
-
XapaJIaMnu almost 8 yearsThanks. I just about figured that out. However it only works on my local machine and not on a Map reduce cluster ;/ For some reason as soon as I provide
--driver-class-path
option with extra jars, i override the default one (i think) and as a result I get an errorpy4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoSuchMethodError: org.apache.http.impl.client.DefaultHttpClient.execute(Lorg/apache/http/client/methods/HttpUriRequest;)Lorg/apache/http/client/methods/CloseableHttpResponse;
Ideas? -
Shantanu Alshi almost 8 yearsI'm afraid it is hard to analyse the exact issue in this case without looking at your actual program and environment variables
-
XapaJIaMnu almost 8 yearsThe problem is a version conflict between the versions from my fat jar and the spark jar. Thanks for your help!
-
thebeancounter over 6 yearsdoes this code have to run on the driver itself? can it run on another machine and execute something on the driver in any way?
-
Rohit almost 5 yearsI have created .egg file and my main file(driver.py) file and now i'm trying to submit my spark application on local using command (spark-submit --py-files local:///C:/git_local/sparkETL/dist/sparkETL-0.1-py3.6.egg driver.py ) But i'm getting error (File not found c:\spark\bin\driver.py). Now i'm confused i have created egg file now it should look for driver.py inside .egg but why it is looking in local path please help
-
mj3c about 4 yearsAs @Rohit mentioned, it is annoying that the main.py cannot be in the .egg file as well, or at least I have not managed to make it work. It is weird that we have to package everything and leave 1 file out of the package to execute the command.