How to add third-party Java JAR files for use in PySpark
Solution 1
You can add external jars as arguments to pyspark
pyspark --jars file1.jar,file2.jar
Solution 2
You could add the path to jar file using Spark configuration at Runtime.
Here is an example :
conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")
sc = SparkContext( conf=conf)
Refer the document for more information.
Solution 3
You could add --jars xxx.jar
when using spark-submit
./bin/spark-submit --jars xxx.jar your_spark_script.py
or set the enviroment variable SPARK_CLASSPATH
SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py
your_spark_script.py
was written by pyspark API
Solution 4
All the above answers did not work for me
What I had to do with pyspark was
pyspark --py-files /path/to/jar/xxxx.jar
For Jupyter Notebook:
spark = (SparkSession
.builder
.appName("Spark_Test")
.master('yarn-client')
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.executor.cores", "4")
.config("spark.executor.instances", "2")
.config("spark.sql.shuffle.partitions","8")
.enableHiveSupport()
.getOrCreate())
# Do this
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")
Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104
Solution 5
- Extract the downloaded jar file.
- Edit system environment variable
- Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.
Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts
- Restart your cluster
WestCoastProjects
R/python/javascript recently and before that Scala/Spark. Machine learning and data pipelines apps.
Updated on March 13, 2021Comments
-
WestCoastProjects about 3 years
I have some third-party database client libraries in Java. I want to access them through
java_gateway.py
E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:
java_import(gateway.jvm, "org.mydatabase.MyDBClient")
It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:
Py4jError: Trying to call a package
Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.
-
WestCoastProjects about 9 yearsnot in a position to check at this moment - but that sounds correct. The errors we were having actually had nothing to do with this, but in any case that does invalidate your answer.
-
Tristan Reid about 8 yearsNote that there are no spaces after the commas! It will fail if you put spaces in there.
-
Ryan Chou about 8 years@stanislav Thanks for your modification.
-
Michael almost 6 yearsI have spark-1.6.1-bin-hadoop2.6 and --jars doesn't work for me. The second option (setting SPARK_CLASSPATH) works. Anyone have any idea why first option doesn't work?
-
iggy over 4 yearsaddPyFile is for python dependencies, not jars spark.apache.org/docs/0.7.2/api/pyspark/…
-
justin cress about 4 yearsDoes this require uploading and deploying the jars to the driver and workers? is the "/path-to-jar/.." the path on the driver node?
-
AAB about 4 years@justincress Hi, I ran it as a standalone cluster but I feel the driver is where the jar files need to be present as the workers/executors do as told by the driver.
-
jamiet about 4 yearsI stumbled in here after googling for “add jar to existing sparksession” so if this works I shall be delighted. Will try it out later today.
-
jamiet almost 4 yearsyep. adding the jar to the jars directory worked. I was then able to call a function in my jar that takes a
org.apache.spark.sql.DataFrame
like this:spark._sc._jvm.com.mypackage.MyObject.myFunction(myPySparkDataFrame._jdf)