difference between rdd.collect().toMap to rdd.collectAsMap()?
11,356
Solution 1
The implementation of collectAsMap
is the following
def collectAsMap(): Map[K, V] = self.withScope {
val data = self.collect()
val map = new mutable.HashMap[K, V]
map.sizeHint(data.length)
data.foreach { pair => map.put(pair._1, pair._2) }
map
}
Thus, there is no performance difference between collect
and collectAsMap
, because collectAsMap
calls under the hood also collect
.
Solution 2
No difference. Avoid using collect() as much as you can as it destroys the concept of parallelism and collects the data on the driver.
Author by
sri hari kali charan Tummala
Updated on July 25, 2022Comments
-
sri hari kali charan Tummala almost 2 years
Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ?
I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any performance impact ?
Original:
val QuoteHashMap=QuoteRDD.collect().toMap val QuoteRDDData=QuoteHashMap.values.toSeq val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) QuoteRDDSet.saveAsTextFile(Quotepath)
Change:
val QuoteHashMap=QuoteRDD.collectAsMap() val QuoteRDDData=QuoteHashMap.values.toSeq val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) QuoteRDDSet.saveAsTextFile(Quotepath)