Spark sort by key and then group by to get ordered iterable?

sorting apache-spark

13,272

The answer from Matei, who I consider authoritative on this topic, is quite clear:

The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each partition only ends up having one element, and collect() does return the partitions in order.

In that context, a better option would be to apply the sorting to the resulting collections per key:

rdd.groupByKey().mapValues(_.sorted)

13,272

Author by

Ben

Updated on June 15, 2022

Comments

Ben about 5 hours

I have a Pair RDD (K, V) with the key containing a time and an ID. I would like to get a Pair RDD of the form (K, Iterable<V>) where the keys are groupped by id and the iterable is ordered by time.

I'm currently using sortByKey().groupByKey() and my tests seem to prove it works, however I'm reading that it may not always be the case, as discussed in this question with diverging answers ( Does groupByKey in Spark preserve the original order? ).

Is it correct or not?

Thanks!
Marko Bonaci about 7 years

Right, depends on the dataset (number of duplicate keys), but it's better to do the sorting on less "rows", after they have already been collapsed by grouping.
maasg about 7 years

@MarkoBonaci that's what's going on here. After groupByKey the resulting grouping is sorted to meet the requirements in the question. I'm not sure what the comment is about. Could you clarify?
Marko Bonaci about 7 years

I was just confirming your last sentence and trying to explain the reason why that's better. We're cool :)
Sohaib about 7 years

@maasg It is written in spark docs to avoid groupByKey if possible. However it seems to me that to sort within a group this seems to be the only alternative. Is there some other way in which we can achieve the same?