Spark / Scala: Passing RDD to Function

scala apache-spark rdd

13,314

Solution 1

In Scala nothing get's copied (in the sense of pass-by-value you have in C/C++) when passed around. Most of the basic types Int, String, Double, etc. are immutable, so passing them by reference is very safe. (Note: If you are passing a mutable object and you change it, then anyone with a reference to that object will see the change).

On top of that, RDDs are lazy, distributed, immutable collections. Passing RDDs through functions and applying transformation to them (map, filter, etc.) doesn't really transfer any data or triggers any computation.

All chained transformations are "remembered" and will automatically get triggered in the right order when you enforce and action on the RDD, such as persisting it, or collecting it locally at the driver (through collect(), take(n), etc.)

Solution 2

Spark implements the principle "send the code to data" rather than sending the data to the code. So here it will happen quite the opposite. It is the function that will be distributed and sent to the RDDs.

RDDs are immutable, so either your function will create a new RDD as a result (transformation) or create some value (action).

The interesting question here is, if you define a function, what exactly is sent to the RDD (and distributed among different nodes, with its transfer cost)? A nice explanation here:

http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark

13,314

Author by

Jes

Updated on June 17, 2022

Comments

Jes almost 2 years
I am curious what exactly passing a RDD to a function does in Spark.
```
def my_func(x : RDD[String]) : RDD[String] = {
  do_something_here
}
```
Suppose we define a function as above. When we call the function and pass an existing RDD[String] object as the input parameter, does this my_function make a "copy" for this RDD as the function parameter? In other words, is it being called-by-reference or called-by-value?