How to add third-party Java JAR files for use in PySpark

75,437

Solution 1

You can add external jars as arguments to pyspark

pyspark --jars file1.jar,file2.jar

Solution 2

You could add the path to jar file using Spark configuration at Runtime.

Here is an example :

conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the document for more information.

Solution 3

You could add --jars xxx.jar when using spark-submit

./bin/spark-submit --jars xxx.jar your_spark_script.py

or set the enviroment variable SPARK_CLASSPATH

SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py

your_spark_script.py was written by pyspark API

Solution 4

All the above answers did not work for me

What I had to do with pyspark was

pyspark --py-files /path/to/jar/xxxx.jar

For Jupyter Notebook:

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104

Solution 5

  1. Extract the downloaded jar file.
  2. Edit system environment variable
    • Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.

Eg: you have extracted the jar file in C drive in folder named sparkts its value should be: C:\sparkts

  1. Restart your cluster
Share:
75,437
WestCoastProjects
Author by

WestCoastProjects

R/python/javascript recently and before that Scala/Spark. Machine learning and data pipelines apps.

Updated on March 13, 2021

Comments

  • WestCoastProjects
    WestCoastProjects about 3 years

    I have some third-party database client libraries in Java. I want to access them through

    java_gateway.py
    

    E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:

    java_import(gateway.jvm, "org.mydatabase.MyDBClient")
    

    It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:

    Py4jError: Trying to call a package

    Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.

  • WestCoastProjects
    WestCoastProjects about 9 years
    not in a position to check at this moment - but that sounds correct. The errors we were having actually had nothing to do with this, but in any case that does invalidate your answer.
  • Tristan Reid
    Tristan Reid about 8 years
    Note that there are no spaces after the commas! It will fail if you put spaces in there.
  • Ryan Chou
    Ryan Chou about 8 years
    @stanislav Thanks for your modification.
  • Michael
    Michael almost 6 years
    I have spark-1.6.1-bin-hadoop2.6 and --jars doesn't work for me. The second option (setting SPARK_CLASSPATH) works. Anyone have any idea why first option doesn't work?
  • iggy
    iggy over 4 years
    addPyFile is for python dependencies, not jars spark.apache.org/docs/0.7.2/api/pyspark/…
  • justin cress
    justin cress about 4 years
    Does this require uploading and deploying the jars to the driver and workers? is the "/path-to-jar/.." the path on the driver node?
  • AAB
    AAB about 4 years
    @justincress Hi, I ran it as a standalone cluster but I feel the driver is where the jar files need to be present as the workers/executors do as told by the driver.
  • jamiet
    jamiet about 4 years
    I stumbled in here after googling for “add jar to existing sparksession” so if this works I shall be delighted. Will try it out later today.
  • jamiet
    jamiet almost 4 years
    yep. adding the jar to the jars directory worked. I was then able to call a function in my jar that takes a org.apache.spark.sql.DataFrame like this: spark._sc._jvm.com.mypackage.MyObject.myFunction(myPySparkDa‌​taFrame._jdf)