PySpark - sortByKey() method to return values from k,v pairs in their original order

15,942

If by "original order" you mean order of the keys then all you have to do is add map after the sort:

myRDD.sortByKey(ascending=True).map(lambda (k, v): v).collect()

or to call values method:

myRDD.sortByKey(ascending=True).values().collect()

If you refer to the order of the values in a structure which has been used to create initial RDD then it is impossible without storying additional information. RDDs are unordered, unless you explicitly apply transformations like sortBy.

Share:
15,942
lagunazul
Author by

lagunazul

Updated on June 15, 2022

Comments

  • lagunazul
    lagunazul almost 2 years

    I need to be able to return a list of values from (key,value) pairs from an RDD while maintaining original order.

    I've included my workaround below but I'd like to be able to do it all in one go.

    Something like:

    myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]
    values = myRDD.<insert PySpark method(s)>
    print values
    >>>[2582, 3222, 4190, 2502, 2537]
    

    My workaround:

    myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]
    
    values = []
    for item in myRDD.sortByKey(True).collect():
                     newlist.append(item[1])
    print values
    >>>[2582, 3222, 4190, 2502, 2537]
    

    Thanks!