Concatenating datasets of different RDDs in Apache spark using scala

scala apache-spark apache-spark-sql distributed-computing rdd

46,523

Solution 1

I think you are looking for RDD.union

val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect

res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))

Solution 2

I had the same problem. To combine by row instead of column use unionAll:

val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)

I found it after reading the method summary for data frame. More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html

46,523

Author by

Atom

Updated on November 12, 2020

Comments

Atom over 3 years

Is there a way to concatenate datasets of two different RDDs in spark?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

RDDs are of type spark.sql.SchemaRDD
Atom over 9 years

rddPart1.union(rddPart2) will add columns of rddPart2 to rddPart1. I need to add rows of rddPart2 to rddPart1. FYI, both the RDDs in this case have the same column names and types
Atom over 9 years

It is more like inserting records into an already existing RDD. Not creating new columns to RDD
maasg over 9 years

@example added an example. There're no new columns to an union RDD.
jwd almost 7 years

While the example makes it look like concatenation takes place (rdd1 is followed by rdd2 in the output), I don't believe union makes any guarantees about ordering of the data. They could get mixed up with each other. Real concatenation is not so easy, because it implies an order dependency in your data, which is fighting against distributed-ness of spark.
Kartoch over 5 years

Not sure it is the right answer, the question was about RDD, not how to do it with dataframes