Concatenating datasets of different RDDs in Apache spark using scala
46,523
Solution 1
I think you are looking for RDD.union
val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)
Example (on Spark-shell)
val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect
res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))
Solution 2
I had the same problem. To combine by row instead of column use unionAll:
val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)
I found it after reading the method summary for data frame. More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html
Author by
Atom
Updated on November 12, 2020Comments
-
Atom over 3 years
Is there a way to concatenate datasets of two different
RDD
s in spark?Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?
RDDs are of type
spark.sql.SchemaRDD
-
Atom over 9 yearsrddPart1.union(rddPart2) will add columns of rddPart2 to rddPart1. I need to add rows of rddPart2 to rddPart1. FYI, both the RDDs in this case have the same column names and types
-
Atom over 9 yearsIt is more like inserting records into an already existing RDD. Not creating new columns to RDD
-
maasg over 9 years@example added an example. There're no new columns to an union RDD.
-
jwd almost 7 yearsWhile the example makes it look like concatenation takes place (rdd1 is followed by rdd2 in the output), I don't believe
union
makes any guarantees about ordering of the data. They could get mixed up with each other. Real concatenation is not so easy, because it implies an order dependency in your data, which is fighting against distributed-ness of spark. -
Kartoch over 5 yearsNot sure it is the right answer, the question was about RDD, not how to do it with dataframes