Spark get collection sorted by value

96,385

Solution 1

Doing it in more pythonic way.

# In descending order
''' The first parameter tells number of elements
    to be present in output.
''' 
data.takeOrdered(10, key=lambda x: -x[1])
# In Ascending order
data.takeOrdered(10, key=lambda x: x[1])

Solution 2

For those looking to get top N elements ordered by value:

theRDD.takeOrdered(N, lambda (key, value): -1 * len(value))

if you wish to order by string length.

On the other hand if the values are already in the form that is suitable for your desired ordering, then:

theRDD.takeOrdered(N, lambda (key, value): -1 * value)

would suffice.

Solution 3

I think you can use the generic sortBy transformation (not an action, i.e., it returns an RDD not an array) documented here.

So in your case, you could do

wordCounts.sortBy(lambda (word, count): count)

Solution 4

you can do it this way

// for reverse order
implicit val sortIntegersByString = new Ordering[Int] {
    override def compare(a: Int, b: Int) = a.compare(b)*(-1)
}

counts.collect.toSeq.sortBy(_._2)

So basically you convert your RDD to a sequence and use the sort method in order to sort it.

The block above globally changes the sort behaviour in order to get a descending sort order.

Solution 5

Simplest way to sort the output by values. After the reduceByKey you can swap the output like key as value and value as key and then you can appply sortByKey method where false sorts in the Descending order. By default it will sort in the ascending order.

 val test=textFile.flatMap(line=> line.split(" ")).map(word=> (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false)
Share:
96,385
user3702916
Author by

user3702916

Updated on May 17, 2021

Comments

  • user3702916
    user3702916 almost 3 years

    I was trying this tutorial http://spark.apache.org/docs/latest/quick-start.html I first created a collection from a file

    textFile = sc.textFile("README.md")
    

    Then I tried a command to cound the words:

    wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
    

    To print the collection:

     wordCounts.collect()
    

    I found how to sort it by word using the command sortByKey. I was wondering how it could be possible to do the same thing for sorting by the value, that in this case in the number that a word occur in the document.

  • Nick Chammas
    Nick Chammas almost 10 years
    You are collecting all the results back to the driver and sorting there. It will work, but only if your result set is relatively small. For a solution that works at scale, see eliasah's solution.
  • Shirish Kumar
    Shirish Kumar over 9 years
    This does not address the problem for large data. If data is small why do you need spark at all.
  • ted
    ted over 7 years
    This answer would need some explanations... You can't just drop some code : stackoverflow.com/help/how-to-answer
  • daphtdazz
    daphtdazz over 7 years
    Just posting an equation isn't very helpful unless you explain what it's doing.
  • Nico Haase
    Nico Haase about 3 years
    Please add some explanation to your answer such that others can learn from it