Concatenating datasets of different RDDs in Apache spark using scala

46,523

Solution 1

I think you are looking for RDD.union

val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect

res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))

Solution 2

I had the same problem. To combine by row instead of column use unionAll:

val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)

I found it after reading the method summary for data frame. More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html

Share:
46,523
Atom
Author by

Atom

Updated on November 12, 2020

Comments

  • Atom
    Atom over 3 years

    Is there a way to concatenate datasets of two different RDDs in spark?

    Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

    RDDs are of type spark.sql.SchemaRDD

  • Atom
    Atom over 9 years
    rddPart1.union(rddPart2) will add columns of rddPart2 to rddPart1. I need to add rows of rddPart2 to rddPart1. FYI, both the RDDs in this case have the same column names and types
  • Atom
    Atom over 9 years
    It is more like inserting records into an already existing RDD. Not creating new columns to RDD
  • maasg
    maasg over 9 years
    @example added an example. There're no new columns to an union RDD.
  • jwd
    jwd almost 7 years
    While the example makes it look like concatenation takes place (rdd1 is followed by rdd2 in the output), I don't believe union makes any guarantees about ordering of the data. They could get mixed up with each other. Real concatenation is not so easy, because it implies an order dependency in your data, which is fighting against distributed-ness of spark.
  • Kartoch
    Kartoch over 5 years
    Not sure it is the right answer, the question was about RDD, not how to do it with dataframes