How to sum values in an iterator in a PySpark groupByKey()

11,203

Solution 1

You can simply mapValues with sum:

example.groupByKey().mapValues(sum)

although in this particular case reduceByKey is much more efficient:

example.reduceByKey(lambda x, y: x + y)

or

from operator import add

example.reduceByKey(add)

Solution 2

Also you can do it in this way:

wordCountsGrouped = wordsGrouped.groupByKey().map(lambda (x,y):(x,map(sum,y))).map(lambda (x,y):(x,y[0]))

It is a bit late but i just found this solution

Share:
11,203

Related videos on Youtube

Leonida Gianfagna
Author by

Leonida Gianfagna

Updated on June 05, 2022

Comments

  • Leonida Gianfagna
    Leonida Gianfagna about 2 years

    I'm doing my first steps on Spark (Python) and I'm struggling with an iterator inside a groupByKey(). I'm not able to sum the values: My code looks like this:

    example = sc.parallelize([('x',1), ('x',1), ('y', 1), ('z', 1)])
    
    example.groupByKey()
    
    x [1,1]
    y [1]
    z [1]
    

    How to have the sum on Iterator? I tried something like below but it does not work

    example.groupByKey().map(lambda (x,iterator) : (x,sum(iterator))
    example.groupByKey().map(lambda (x,iterator) : (x,list(sum(iterator)))
    
  • Kent Wong
    Kent Wong over 4 years
    Are you sure this works as intended? From my understanding, you're simply creating a lambda function that lists X variables and the length of the iterator, this doesn't sum them up. I'm learning spark and going through a tutorial at the moment. The above statement simply pairs X and the length of an iterator. It doesn't sum the iterator.
  • PyWalker2797
    PyWalker2797 almost 4 years
    Can you elaborate on your solution?

Related