How to sort by value efficiently in PySpark?

51,708

Solution 1

I think sortBy() is more concise:

b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]

It's actually not more efficient at all as it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).

Solution 2

Just wanted to add this tip.. which helped me out alot

Ascending:

bSorted = b.sortBy(lambda a: a[1])

Descending:

bSorted = b.sortBy(lambda a: -a[1])
Share:
51,708
makansij
Author by

makansij

I'm a PhD Student at University of Southern California.

Updated on July 09, 2022

Comments

  • makansij
    makansij almost 2 years

    I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrdered is good for this if you know how many you need:

    b = sc.parallelize([('t',3),('b',4),('c',1)])
    

    Using TakeOrdered:

    b.takeOrdered(3,lambda atuple: atuple[1])
    

    Using Lambda

    b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
        lambda aTuple: (aTuple[0], aTuple[1])).collect()
    

    I've checked out the question here, which suggests the latter. I find it hard to believe that takeOrdered is so succinct and yet it requires the same number of operations as the Lambda solution.

    Does anyone know of a simpler, more concise Transformation in spark to sort by value?