Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)?

29,502

Solution 1

reduce vs foldLeft

A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.

This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists. The collection can be chopped up and the reduce can operate on each chunk, then the reduce can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.

If you just look at the signatures there is no reason for reduce to exist because you can achieve everything you can with reduce with a foldLeft. The functionality of foldLeft is a greater than the functionality of reduce.

But you cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.

foldLeft does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.

If you have a look at the Spark documentation for reduce it specifically says "... commutative and associative binary operator"

http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

Here is proof that reduce is NOT just a special case of foldLeft

scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par

scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds

scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds

reduce vs fold

Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.

There is no fold method in Scalding because under the (strict) Map Reduce programming model we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.

Put simply, reduce works without an order of cumulation, fold requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x and then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.

It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reduce has become a synonym for fold, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.

So Spark does have a fold, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out that fold uses runJob, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReduce but no treeFold.

Conclusion

There is a difference between reduce and fold even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.

In practice due to the unmathematical nature of programming, reduce and fold tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).

Extra: My Opinion On the Spark API

My opinion is that confusion would be avoided if use of the term fold was completely dropped in Spark. At least spark does have a note in their documentation:

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.

Solution 2

If I am not mistaken, even though the Spark API does not require it, fold also requires for the f to be commutative. Because the order in which the partitions will be aggregated is not assured. For example in the following code only the first print out is sorted:

import org.apache.spark.{SparkConf, SparkContext}

object FoldExample extends App{

  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("Simple Application")
  implicit val sc = new SparkContext(conf)

  val range = ('a' to 'z').map(_.toString)
  val rdd = sc.parallelize(range)

  println(range.reduce(_ + _))
  println(rdd.reduce(_ + _))
  println(rdd.fold("")(_ + _))
}  

Print out:

abcdefghijklmnopqrstuvwxyz

abcghituvjklmwxyzqrsdefnop

defghinopjklmqrstuvabcwxyz

Solution 3

One other difference for Scalding is the use of combiners in Hadoop.

Imagine your operation is commutative monoid, with reduce it will be applied on the map side also instead of shuffling/sorting all data to reducers. With foldLeft this is not the case.

pipe.groupBy('product) {
   _.reduce('price -> 'total){ (sum: Double, price: Double) => sum + price }
   // reduce is .mapReduceMap in disguise
}

pipe.groupBy('product) {
   _.foldLeft('price -> 'total)(0.0){ (sum: Double, price: Double) => sum + price }
}

It is always good practice to define your operations as monoid in Scalding.

Share:
29,502
samthebest
Author by

samthebest

To make me answer a question I like to answer questions on Spark, Hadoop, Big Data and Scala. I'm pretty good at Bash, git and Linux, so I can sometimes answer these questions too. I've stopped checking my filters for new questions these days, so I'm probably not answering questions which I probably could. Therefore if you think I can help, especially with Spark and Scala, then rather than me give me email out, please comment on a similar question/answer of mine with a link. Furthermore cross-linking similar questions can be nice for general SO browsing and good for SEO. My favourite answers Round parenthesis are much much better than curly braces http://stackoverflow.com/a/27686566/1586965 Underscore evangelism and in depth explanation http://stackoverflow.com/a/25763401/1586965 Generalized memoization http://stackoverflow.com/a/19065888/1586965 Monad explained in basically 2 LOCs http://stackoverflow.com/a/20707480/1586965

Updated on July 08, 2022

Comments

  • samthebest
    samthebest almost 2 years

    Why do Scala and frameworks like Spark and Scalding have both reduce and foldLeft? So then what's the difference between reduce and fold?

  • kiritsuku
    kiritsuku over 9 years
    That is why foldLeft contains the Left in its name and why there is also a method called fold.
  • samthebest
    samthebest over 9 years
    @sschaef true, but fold doesn't exist in, say Scalding, because unlike reduce it doesn't require commutativity. I've updated my answer to explain this. Basically the point I'm trying to make is that the difference between fold* and reduce* is very much related to the roots of FP in Category Theory.
  • Xiaohe Dong
    Xiaohe Dong over 9 years
    Do you have an example for distinct (i.e. x op y != y op x) by using reduce and fold to see the difference. I dont think it is correct statement for reduce. For example, I use non-commutative operation "divide" to do the reduce. (List(1000000.0) ::: List.tabulate(100)(_ + 0.001)).reduce(_ / _) For the statement "(i.e. x op y != y op x)." it should give me random result, but it gives me the same result every time, so I think reduce still keep the order of cumulation
  • samthebest
    samthebest over 9 years
    @Cloudtech That is a coincidence of it's single threaded implementation, not within it's specification. On my 4-core machine, if I try adding .par, so (List(1000000.0) ::: List.tabulate(100)(_ + 0.001)).par.reduce(_ / _) I get different results each time.
  • Alex Dean
    Alex Dean over 9 years
    Great answer. I think that "reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative" should read: "reduce should be given a commutative semigroup, i.e. a data structure which supports a combining operation that is both commutative and associative". I don't believe that reduce needs an identity element.
  • samthebest
    samthebest over 9 years
    @AlexDean in the context of computer science, no it doesn't really need an identity as empty collections tend to just throw exceptions. But it's mathematically more elegant (and would be more elegant if collections did this) if the identity element is returned when the collection is empty. In mathematics "throw an exception" doesn't exist.
  • Alex Dean
    Alex Dean over 9 years
    @samthebest - I see what you're saying, thanks. This is why you have to work with sumOption for Semigroups in Algebird...
  • Make42
    Make42 almost 8 years
    @samthebest: Are your sure about the commutativity? github.com/apache/spark/blob/… says "For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection."
  • zero323
    zero323 almost 8 years
    To be fair contract of fold on RDD is exactly the same as contract of Scala ParSeq.fold. Although ParSeqs typically (always?) keep the order it is not really a part of the contract. One can easily implement fold (or not so easily foldByKey) which guarantee order of merging and the main reason to do it as it is done now is performance / flexibility of scheduling.
  • samthebest
    samthebest almost 8 years
    @Make42 I've updated my answer, indeed I assumed that the order of combining was done on the order of partitions (which would be nice, and the logical way to implement it), but it turns out it's based on a more of a "first come first serve" basis, and thus non-deterministic.
  • samthebest
    samthebest almost 8 years
    After some back and forth, we believe you are correct. The order of combining is first come first serve. If you run sc.makeRDD(0 to 9, 2).mapPartitions(it => { java.lang.Thread.sleep(new java.util.Random().nextInt(1000)); it } ).map(_.toString).fold("")(_ + _) with 2+ cores several times, I think you will see it produces random (partition-wise) order. I've updated my answer accordingly.
  • Make42
    Make42 almost 8 years
    @samthebest: So, in Spark, all of the combiner methods (reduce, reduceByKey, fold, foldByKey, aggregate, aggregateByKey, combineByKey) need to get functions passed that are both, associative and commutative afterall, right?
  • samthebest
    samthebest almost 8 years
    @Make42 That's correct, one could write their own reallyFold pimp though, as: rdd.mapPartitions(it => Iterator(it.fold(zero)(f)))).collect().fold(zero)(f), this wouldn't need f to commute.
  • Make42
    Make42 almost 8 years
    @samthebest: Not sure this works... The collect also gets the partitions first come, first serve. How do you know they are in the right order for the second fold? I tried this with a real example and I got my words back in the correct order with both your code and the original Spark fold. I suspect that within a partition fold in Spark will not need commutativity, but that putting the partitions together is the issue. Your solution does not improve on that, since the collect will have the same issue. What do you think?
  • samthebest
    samthebest almost 8 years
    @Make42 I read the code for collect, it does preserve order of partitions. Step through it and you'll see this (index, res) => results(index) = res as the result handler - so the result handler uses the partition index to place the result in the Array
  • altayseyhan
    altayseyhan about 7 years
    @samthebest so at this point I have a question afaik string concat is non-commutative, when I run this code multiple times List("abc","def","ghi","jk","lmnop","qrs","tuv","wx","yz").p‌​ar.reduce(_+_) I guess it should give me random result but I'm getting same result everytime.
  • altayseyhan
    altayseyhan about 7 years
    @Cloudtech @samthebest I guess in this case of (List(1000000.0) ::: List.tabulate(100)(_ + 0.001)).par.reduce(_ / _) producing different results in each time related with division operation being not associative I mean not related with being non-commutative [docs.scala-lang.org/overviews/parallel-collections/… please look at line starting with "Note: Often, it is thought that.."
  • jrista
    jrista almost 4 years
    Truly hate to break the perfect 256 vote count, but this is an excellent answer!!