Understanding parallelism in Spark and Scala

13,877

Solution 1

SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.

Use Spark when

  1. You have more than 1 node
  2. You want your job to be ready to scale to multiple nodes
  3. The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework

Solution 2

SparkContext's parallelize may makes your collection suitable for processing on multiple nodes, as well as on multiple local cores of your single worker instance ( local[2] ), but then again, you probably get too much overhead from running Spark's task scheduler an all that magic. Of course, Scala's parallel collections should be faster on single machine.

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections - are your files big enough to be automatically split to multiple slices, did you try setting slices number manually?

Did you try running the same Spark job on single core and then on two cores?

Expect best result from Spark with one really big uniformly structured file, not with multiple smaller files.

Share:
13,877
MARK
Author by

MARK

Computer Software Engineer

Updated on July 25, 2022

Comments

  • MARK
    MARK almost 2 years

    I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.

    In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.

    I am running my experiments in localhost mode with the arguments local[2] for the spark context.

    My question is when should I use scala's parallel collections and when to use spark context's parallelize?