Spark get collection sorted by value

sorting apache-spark word-count

96,385

Solution 1

Doing it in more pythonic way.

# In descending order
''' The first parameter tells number of elements
    to be present in output.
''' 
data.takeOrdered(10, key=lambda x: -x[1])
# In Ascending order
data.takeOrdered(10, key=lambda x: x[1])

Solution 2

For those looking to get top N elements ordered by value:

theRDD.takeOrdered(N, lambda (key, value): -1 * len(value))

if you wish to order by string length.

On the other hand if the values are already in the form that is suitable for your desired ordering, then:

theRDD.takeOrdered(N, lambda (key, value): -1 * value)

would suffice.

Solution 3

I think you can use the generic sortBy transformation (not an action, i.e., it returns an RDD not an array) documented here.

So in your case, you could do

wordCounts.sortBy(lambda (word, count): count)

Solution 4

you can do it this way

// for reverse order
implicit val sortIntegersByString = new Ordering[Int] {
    override def compare(a: Int, b: Int) = a.compare(b)*(-1)
}

counts.collect.toSeq.sortBy(_._2)

So basically you convert your RDD to a sequence and use the sort method in order to sort it.

The block above globally changes the sort behaviour in order to get a descending sort order.

Solution 5

Simplest way to sort the output by values. After the reduceByKey you can swap the output like key as value and value as key and then you can appply sortByKey method where false sorts in the Descending order. By default it will sort in the ascending order.

 val test=textFile.flatMap(line=> line.split(" ")).map(word=> (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false)

View more solutions

96,385

Author by

user3702916

Updated on May 17, 2021

Comments

user3702916 almost 3 years
I was trying this tutorial http://spark.apache.org/docs/latest/quick-start.html I first created a collection from a file
```
textFile = sc.textFile("README.md")
```
Then I tried a command to cound the words:
```
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
```
To print the collection:
```
 wordCounts.collect()
```
I found how to sort it by word using the command sortByKey. I was wondering how it could be possible to do the same thing for sorting by the value, that in this case in the number that a word occur in the document.
Nick Chammas almost 10 years

You are collecting all the results back to the driver and sorting there. It will work, but only if your result set is relatively small. For a solution that works at scale, see eliasah's solution.
Shirish Kumar over 9 years

This does not address the problem for large data. If data is small why do you need spark at all.
ted over 7 years

This answer would need some explanations... You can't just drop some code : stackoverflow.com/help/how-to-answer
daphtdazz over 7 years

Just posting an equation isn't very helpful unless you explain what it's doing.
Nico Haase about 3 years

Please add some explanation to your answer such that others can learn from it