difference between rdd.collect().toMap to rdd.collectAsMap()?

11,356

Solution 1

The implementation of collectAsMap is the following

def collectAsMap(): Map[K, V] = self.withScope {
    val data = self.collect()
    val map = new mutable.HashMap[K, V]
    map.sizeHint(data.length)
    data.foreach { pair => map.put(pair._1, pair._2) }
    map
  }

Thus, there is no performance difference between collect and collectAsMap, because collectAsMap calls under the hood also collect.

Solution 2

No difference. Avoid using collect() as much as you can as it destroys the concept of parallelism and collects the data on the driver.

Share:
11,356
sri hari kali charan Tummala
Author by

sri hari kali charan Tummala

Updated on July 25, 2022

Comments

  • sri hari kali charan Tummala
    sri hari kali charan Tummala almost 2 years

    Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ?

    I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any performance impact ?

    Original:

    val QuoteHashMap=QuoteRDD.collect().toMap 
    val QuoteRDDData=QuoteHashMap.values.toSeq 
    val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) 
    QuoteRDDSet.saveAsTextFile(Quotepath) 
    

    Change:

    val QuoteHashMap=QuoteRDD.collectAsMap() 
    val QuoteRDDData=QuoteHashMap.values.toSeq 
    val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")",""))) 
    QuoteRDDSet.saveAsTextFile(Quotepath)