How to add a new column to a Spark RDD?
18,751
Solution 1
You do not have to use Tuple
* objects at all for adding a new column to an RDD
.
It can be done by mapping each row, taking its original contents plus the elements you want to append, for example:
val rdd = ...
val withAppendedColumnsRdd = rdd.map(row => {
val originalColumns = row.toSeq.toList
val secondColValue = originalColumns(1).asInstanceOf[Int]
val thirdColValue = originalColumns(2).asInstanceOf[Int]
val newColumnValue = secondColValue + thirdColValue
Row.fromSeq(originalColumns :+ newColumnValue)
// Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns
})
Solution 2
you have RDD of tuple 4, apply map and convert it to tuple5
val rddTuple4RDD = ...........
val rddTuple5RDD = rddTuple4RDD.map(r=> Tuple5(rddTuple4._1, rddTuple4._2, rddTuple4._3, rddTuple4._4, rddTuple4._2 + rddTuple4._3))
Related videos on Youtube
Author by
Carter
Updated on September 14, 2022Comments
-
Carter over 1 year
I have a RDD with MANY columns (e.g., hundreds), how do I add one more column at the end of this RDD?
For example, if my RDD is like below:
123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 ...... 29, 94, 956, ..., 758
how can I add a column to it, whose value is the sum of the second and the third columns?
Thank you very much.
-
Carter almost 9 yearsThanks sb'. One problem is that in my real data, there are many columns (e.g., hundreds), it is not easy to enumerate the values of all columns. Is there a way to handle many columns?