How to use a Scala class inside Pyspark
Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.
Assuming your class is int the package com.example
and have Python DataFrame
called df
df = ... # Python DataFrame
you'll have to:
Build a jar using your favorite build tool.
Include it in the driver classpath for example using
--driver-class-path
argument for PySpark shell /spark-submit
. Depending on the exact code you may have to pass it using--jars
as well-
Extract JVM instance from a Python
SparkContext
instance:jvm = sc._jvm
-
Extract Scala
SQLContext
from aSQLContext
instance:ssqlContext = sqlContext._ssql_ctx
-
Extract Java
DataFrame
from thedf
:jdf = df._jdf
-
Create new instance of
SimpleClass
:simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")
-
Call
exe
method and wrap the result using PythonDataFrame
:from pyspark.sql import DataFrame DataFrame(simpleObject.exe(), ssqlContext)
The result should be a valid PySpark DataFrame
. You can of course combine all the steps into a single call.
Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.
Related videos on Youtube
Alberto Bonsanto
Computer Scientist with a passion for Artificial Intelligence and Machine Learning. I have a background in Software Development, Big Data and Distributed Computing and +4 years of experience in the field. I hold Bachelor's degrees in computer engineering and electrical engineering, and currently work as a Data Scientist (ML Engineer) at Rappi -- one of the highest valued tech company in Latin America.
Updated on September 14, 2022Comments
-
Alberto Bonsanto about 1 year
I've been searching for a while if there is any way to use a
Scala
class inPyspark
, and I haven't found any documentation nor guide about this subject.Let's say I create a simple class in
Scala
that uses some libraries ofapache-spark
, something like:class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) { def exe(): DataFrame = { import sqlContext.implicits._ df.select(col(column)) } }
- Is there any possible way to use this class in
Pyspark
? - Is it too tough?
- Do I have to create a
.py
file? - Is there any guide that shows how to do that?
By the way I also looked at the
spark
code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose. - Is there any possible way to use this class in
-
se7entyse7en over 5 yearsWhat happen if the scala class has also alternative constructors? Is it supposed to work?