How to add third-party Java JAR files for use in PySpark

python apache-spark pyspark py4j

75,437

Solution 1

You can add external jars as arguments to pyspark

pyspark --jars file1.jar,file2.jar

Solution 2

You could add the path to jar file using Spark configuration at Runtime.

Here is an example :

conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the document for more information.

Solution 3

You could add --jars xxx.jar when using spark-submit

./bin/spark-submit --jars xxx.jar your_spark_script.py

or set the enviroment variable SPARK_CLASSPATH

SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py

your_spark_script.py was written by pyspark API

Solution 4

All the above answers did not work for me

What I had to do with pyspark was

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104

Solution 5

Extract the downloaded jar file.
Edit system environment variable
- Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.

Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts

Restart your cluster

View more solutions

75,437

Author by

WestCoastProjects

R/python/javascript recently and before that Scala/Spark. Machine learning and data pipelines apps.

Updated on March 13, 2021

Comments

WestCoastProjects about 3 years
I have some third-party database client libraries in Java. I want to access them through
```
java_gateway.py
```
E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:
```
java_import(gateway.jvm, "org.mydatabase.MyDBClient")
```
It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:

Py4jError: Trying to call a package

Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.
WestCoastProjects about 9 years

not in a position to check at this moment - but that sounds correct. The errors we were having actually had nothing to do with this, but in any case that does invalidate your answer.
Tristan Reid about 8 years

Note that there are no spaces after the commas! It will fail if you put spaces in there.
Ryan Chou about 8 years

@stanislav Thanks for your modification.
Michael almost 6 years

I have spark-1.6.1-bin-hadoop2.6 and --jars doesn't work for me. The second option (setting SPARK_CLASSPATH) works. Anyone have any idea why first option doesn't work?
iggy over 4 years

addPyFile is for python dependencies, not jars spark.apache.org/docs/0.7.2/api/pyspark/…
justin cress about 4 years

Does this require uploading and deploying the jars to the driver and workers? is the "/path-to-jar/.." the path on the driver node?
AAB about 4 years

@justincress Hi, I ran it as a standalone cluster but I feel the driver is where the jar files need to be present as the workers/executors do as told by the driver.
jamiet about 4 years

I stumbled in here after googling for “add jar to existing sparksession” so if this works I shall be delighted. Will try it out later today.
jamiet almost 4 years

yep. adding the jar to the jars directory worked. I was then able to call a function in my jar that takes a org.apache.spark.sql.DataFrame like this: spark._sc._jvm.com.mypackage.MyObject.myFunction(myPySparkDa‌taFrame._jdf)