How to sum values in an iterator in a PySpark groupByKey()
11,203
Solution 1
You can simply mapValues
with sum
:
example.groupByKey().mapValues(sum)
although in this particular case reduceByKey
is much more efficient:
example.reduceByKey(lambda x, y: x + y)
or
from operator import add
example.reduceByKey(add)
Solution 2
Also you can do it in this way:
wordCountsGrouped = wordsGrouped.groupByKey().map(lambda (x,y):(x,map(sum,y))).map(lambda (x,y):(x,y[0]))
It is a bit late but i just found this solution
Related videos on Youtube
Author by
Leonida Gianfagna
Updated on June 05, 2022Comments
-
Leonida Gianfagna about 2 years
I'm doing my first steps on Spark (Python) and I'm struggling with an iterator inside a
groupByKey()
. I'm not able to sum the values: My code looks like this:example = sc.parallelize([('x',1), ('x',1), ('y', 1), ('z', 1)]) example.groupByKey()
x [1,1] y [1] z [1]
How to have the sum on
Iterator
? I tried something like below but it does not workexample.groupByKey().map(lambda (x,iterator) : (x,sum(iterator)) example.groupByKey().map(lambda (x,iterator) : (x,list(sum(iterator)))
-
Kent Wong over 4 yearsAre you sure this works as intended? From my understanding, you're simply creating a lambda function that lists X variables and the length of the iterator, this doesn't sum them up. I'm learning spark and going through a tutorial at the moment. The above statement simply pairs X and the length of an iterator. It doesn't sum the iterator.
-
PyWalker2797 almost 4 yearsCan you elaborate on your solution?