Spark get collection sorted by value
Solution 1
Doing it in more pythonic way.
# In descending order
''' The first parameter tells number of elements
to be present in output.
'''
data.takeOrdered(10, key=lambda x: -x[1])
# In Ascending order
data.takeOrdered(10, key=lambda x: x[1])
Solution 2
For those looking to get top N elements ordered by value:
theRDD.takeOrdered(N, lambda (key, value): -1 * len(value))
if you wish to order by string length.
On the other hand if the values are already in the form that is suitable for your desired ordering, then:
theRDD.takeOrdered(N, lambda (key, value): -1 * value)
would suffice.
Solution 3
I think you can use the generic sortBy
transformation (not an action, i.e., it returns an RDD not an array) documented here.
So in your case, you could do
wordCounts.sortBy(lambda (word, count): count)
Solution 4
you can do it this way
// for reverse order
implicit val sortIntegersByString = new Ordering[Int] {
override def compare(a: Int, b: Int) = a.compare(b)*(-1)
}
counts.collect.toSeq.sortBy(_._2)
So basically you convert your RDD to a sequence and use the sort method in order to sort it.
The block above globally changes the sort behaviour in order to get a descending sort order.
Solution 5
Simplest way to sort the output by values. After the reduceByKey you can swap the output like key as value and value as key and then you can appply sortByKey method where false sorts in the Descending order. By default it will sort in the ascending order.
val test=textFile.flatMap(line=> line.split(" ")).map(word=> (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false)
user3702916
Updated on May 17, 2021Comments
-
user3702916 almost 3 years
I was trying this tutorial http://spark.apache.org/docs/latest/quick-start.html I first created a collection from a file
textFile = sc.textFile("README.md")
Then I tried a command to cound the words:
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
To print the collection:
wordCounts.collect()
I found how to sort it by word using the command sortByKey. I was wondering how it could be possible to do the same thing for sorting by the value, that in this case in the number that a word occur in the document.
-
Nick Chammas almost 10 yearsYou are collecting all the results back to the driver and sorting there. It will work, but only if your result set is relatively small. For a solution that works at scale, see eliasah's solution.
-
Shirish Kumar over 9 yearsThis does not address the problem for large data. If data is small why do you need spark at all.
-
ted over 7 yearsThis answer would need some explanations... You can't just drop some code : stackoverflow.com/help/how-to-answer
-
daphtdazz over 7 yearsJust posting an equation isn't very helpful unless you explain what it's doing.
-
Nico Haase about 3 yearsPlease add some explanation to your answer such that others can learn from it