What are the differences between sc.parallelize and sc.textFile?

28,351

(1) variable a will be saved as a RDD variable containing the expected txt file content

(Highlighting mine) Not really. The line just describes what will happen after you execute an action, i.e. the RDD variable does not contain the expected txt file content.

The RDD describes the partitions that, when an action is called, become tasks that will read their parts of the input file.

(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on. Now these Tasks are assigned to worker nodes.

Yes, but only when an action is called which is c=b.collect() in your case.

(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.

YES! That's the most dangerous operation memory-wise since all the Spark executors running somewhere in the cluster start sending data back to the driver.

Now I want to understand what difference below code makes

Quoting the documentation of sc.textFile:

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

Quoting the documentation of sc.parallelize:

parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T] Distribute a local Scala collection to form an RDD.

The difference is with the datasets - files (for textFile) while a local collection (for parallelize). Either does the same things under the covers, i.e. they both build a description of how to access the data that are going to be processed using transformations and an action.

The main difference is therefore the source of the data.

Share:
28,351
user2531569
Author by

user2531569

Updated on July 02, 2020

Comments

  • user2531569
    user2531569 almost 4 years

    I am new to Spark. can someone please clear my doubt:

    Lets assume below is my code:

    a = sc.textFile(filename) 
    b = a.filter(lambda x: len(x)>0 and x.split("\t").count("111"))
    c = b.collect()
    

    I hope below is what happens internally: (Please correct if my understanding is wrong)

    (1) variable a will be saved as a RDD variable containing the expected txt file content

    (2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on. Now these Tasks are assigned to worker nodes.

    (3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.

    Now I want to understand what difference below code makes:

    a = sc.textFile(filename).collect() 
    b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("111")) 
    c = b.collect() 
    

    Could someone please clarify ?

  • BARATH
    BARATH over 5 years
    I have an rdd with list of file names in a folder obtained using parallelise function and I am using foreach to read each files in an rdd then will the files be rear in parallel and by this can I perform any transformation on each individual files parallely..
  • Jacek Laskowski
    Jacek Laskowski over 5 years
    @BARATH That deserves a separate question (and think one was asked already - search stackoverflow). Just to give you one, you can have a parallel and distributed computation in Spark depending on the number of records (files in your case) per partition. Partitions are the parallel tasks. The more the more parallel computation.
  • Sheel Pancholi
    Sheel Pancholi almost 5 years
    @Jacek Laskowski - sc.textFile() I understand. How does sc.parallelize(<collection>) work? The collection is created in the driver program itself. How is (are) the RDD (partitions) created from this collection distributed to the worker nodes? Do the partitions first get distributed to the worker nodes and then tasks scheduled on the executors on those nodes? Or, do the individual tasks have their corr. RDD partition already bundled in it when the DAGScheduler->TaskScheduler created in the driver program intend to distribute the tasks to the executors on the worker nodes?