How to use map() to convert (key,values) pair to values only in Pyspark

25,126

Solution 1

Finally I got the answer, its like this -->

wordCounts
.map(lambda x:x[1])
.reduce(lambda x,y:x + y)

Solution 2

Yes, your lambda function in .map takes in a tuple x as an argument and returns the 2nd element via x[1](the 2nd index in the tuple). You could also take in the tuple as an argument and return the 2nd element as follows:

.map(lambda (x,y) : y)
Share:
25,126
user2090166
Author by

user2090166

Updated on July 09, 2022

Comments

  • user2090166
    user2090166 almost 2 years

    I have this code in PySpark to .

    wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
    wordsRDD = sc.parallelize(wordsList, 4)
    
    
    wordCounts = wordPairs.reduceByKey(lambda x,y:x+y)
    print wordCounts.collect()
    
    #PRINTS-->  [('rat', 2), ('elephant', 1), ('cat', 2)]
    
    from operator import add
    totalCount = (wordCounts
                  .map(<< FILL IN >>)
                  .reduce(<< FILL IN >>))
    
    #SHOULD PRINT 5
    
    #(wordCounts.values().sum()) // does the trick but I want to this with map() and reduce()
    
    
    I need to use a reduce() action to sum the counts in wordCounts and then divide by the number of unique words.
    

    * But first I need to map() the pair RDD wordCounts, which consists of (key, value) pairs, to an RDD of values.

    This is where I am stuck. I tried something like this below, but none of them work:

    .map(lambda x:x.values())
    .reduce(lambda x:sum(x)))
    
    AND,
    
    .map(lambda d:d[k] for k in d)
    .reduce(lambda x:sum(x)))
    

    Any help in this would be highly appreciated!