How do I split a Spark rdd Array[(String, Array[String])]?

scala apache-spark rdd

10,522

Solution 1

what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple. rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] exactly as it says :). now if you want to split the record you need to get it from this tuple. you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)

but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.

Solution 2

if you want to sort the rdd using the 7th string in the array, you can just do it directly by

rdd.sortBy(_(6)) // array starts at 0 not 1

rdd.sortBy(arr => arr(6))

That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).

To test this, i did the following:

val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))

// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))

Here's the result:

Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))

Solution 3

just do this:

val rdd4 = rdd3.map(_._2)

10,522

Author by

KDC

Updated on August 21, 2022

Comments

KDC over 1 year
I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
```
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
```
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
```
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
```
I then apply sortByKey, still no problem...
```
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
```
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
```
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
```
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
KDC almost 8 years

I tried your suggestion of rdd.sortBy(.=>.7), but that puts out "error: identifier expected but '=>' found". Can you edit that so I can accept your answer? As you suggest, rdd3.map(_._2) does the job as well but requires a bit more work.
jtitusj almost 8 years

.sortBy(c => c._7) won't work as well as .sortBy(_._7) since the elements in the rdd have an Array structure. @KoenDeCouck, i've posted my answer. you might want to check it out. :)
KDC almost 8 years

It didn't, however @John Titus Jungao 's answer has the solution: rdd.sortBy(_(7)). I'll accept this answer since the question focussed on the split after all and you gave some info on why that didn't work.
KDC almost 8 years

Thank you John! This solution looks like the better way to sort. I'll accept Zahiro's answer however because of the way the question was phrased, appended with your solution. (Upvoted this)
Zahiro Mor almost 8 years

yes of course. now I see my mistake. with @JohnTitusJungao permission I can edit it for future reference.
jtitusj almost 8 years

@ZahiroMor, sure! go ahead. glad that helped. :)