Number of partitions in RDD and performance in Spark

performance apache-spark pyspark rdd

45,370

Solution 1

The primary effect would be by specifying too few partitions or far too many partitions.

Too few partitions You will not utilize all of the cores available in the cluster.

Too many partitions There will be excessive overhead in managing many small tasks.

Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.

Solution 2

To add to javadba's excellent answer, I recall the docs recommend to have your number of partitions set to 3 or 4 times the number of CPU cores in your cluster so that the work gets distributed more evenly among the available CPU cores. Meaning, if you only have 1 partition per CPU core in the cluster you will have to wait for the one longest running task to complete but if you had broken that down further the workload would be more evenly balanced with fast and slow running tasks evening out.

Solution 3

Number of partition have high impact on spark's code performance.

Ideally the spark partition implies how much data you want to shuffle. Normally you should set this parameter on your shuffle size(shuffle read/write) and then you can set the number of partition as 128 to 256 MB per partition to gain maximum performance.

You can set partition in your spark sql code by setting the property as:

spark.sql.shuffle.partitions

or while using any dataframe you can set this by below:

df.repartition(numOfPartitions)

45,370

mar tin

Updated on July 09, 2022

Comments

mar tin almost 2 years
In Pyspark, I can create a RDD from a list and decide how many partitions to have:
```
sc = SparkContext()
sc.parallelize(xrange(0, 10), 4)
```
How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?
- Katya Willard about 8 years
  
  I highly recommend this link. Jump to the section "Tuning Parallelism". It gets pretty complicated, but it is certainly in depth.
Admin over 7 years

Indeed. I have a particular job that used to run out of memory when it had 2 million large json objects split across 25K partitions. When I coalesce to 320, the job runs in less than a minute. This seems like one of the easiest ways to improve performance.
ruhong almost 7 years

Cons of too many partitions: More fetches, more disk seeks. Driver needs to track state per task. Source: slide 53, slideshare.net/Hadoop_Summit/why-your-spark-job-is-failing
sversch over 6 years

Can you provide a reference to where the recommended number of partitions is described in the docs, please?
shubham rajput over 6 years

@sversch This hakernoon blog post might help you in deciding optimal number of partitions.
sversch over 6 years

@ChrisTownsend the page you linked to says "In general, we recommend 2-3 tasks per CPU core in your cluster".
Chris Townsend over 6 years

@sversch, right, and Spark can only run 1 concurrent task per partition, up the the number of cores in your cluster. If you have fewer partitions than CPU cores in your cluster then you are not running with max parallelism. If you only have one partition per core then you end up waiting for the longest running task. If you have too many partitions then you end up generating more overhead and possibly making it slower.
sversch over 6 years

@ChrisTownsend I don't disagree with you on the point about the need to properly configure for best performance. To be clear: in my first comment I intended to suggest that you provide a reference, e.g. putting a link in your answer to the doc page, seeing as you're claiming the docs say something in particular. In my 2nd comment, I'm pointing out that I couldn't find the place on the doc page you linked to which supports your claim. Also, your terminology switches between CPUs and cores. Which is it? I suggest you edit your answer to make these things clear.
Chris Townsend over 6 years

@sversch, it is CPU cores, I've revised to match the documentation I linked to and hopefully provide clarity. The docs change and my understanding is that answers are generally expected to be self-contained for that reason. Re-read the docs and given the guidelines for parallelism and how partitions work what would you suggest as an alternative understanding for guidelines regarding the recommended number of partitions?
jose rivera about 6 years

In the book, Spark in action, Petar Zecevi´c page 74 said "We recommend using three to four times more partitions than there are cores in your cluster"
Kenny about 5 years

By "Number of CPU cores in cluster" do you mean spark.driver.cores or spark.executor.cores ?
Vivek Sethi almost 5 years

@Kenny spark.executor.cores it is!
spacedustpi about 4 years

@user2918461: How many total cores were in your cluster when you did this test?
ijoseph over 3 years

After quite a bit of digging through the source code, I finally figured out how to get the default value for number of partitions: sc.defaultParallelism (sc = pyspark.SparkContext())