Spark 2.0 memory fraction

memory apache-spark out-of-memory distributed-computing apache-spark-2.0

12,522

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually
large records.

spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. Otherwise, when much of this space is used for caching and execution, the tenured generation will be full, which causes the JVM to significantly increase time spent in garbage collection.

In spark-1.6.2 I would modify the spark.storage.memoryFraction.

As a side note, are you sure that you understand how your job behaves?

It's typical to fine tune your job starting from the memoryOverhead, #cores , etc. firstly and then move on to the attribute you modified.

12,522

Author by

syl

Updated on July 31, 2022

Comments

syl almost 2 years

I am working with Spark 2.0, the job starts by sorting the input data and storing its output on HDFS.

I was getting out of memory errors, the solution was to increase the value of "spark.shuffle.memoryFraction" from 0.2 to 0.8 and this solved the problem. But in the documentation I have found that this is a deprecated parameter.

As I understand, it was replaced by "spark.memory.fraction". How to modify this parameter while taking into account the sort and storage on HDFS?