Apache Spark spilling to disk
Solution 1
I found that one of my columns had nulls
throughout causing a skew which resulted in constant spills.
Solution 2
There are different memory arenas in play. For caching Spark uses spark.storage.memoryFraction
(defaults to 60%) of the heap. This is what most of the "free memory" messages are about. It uses spark.shuffle.memoryFraction
(defaults to 20%) of the heap for shuffle. I think this is what the spill messages are about. You can disable shuffle spill entirely by setting spark.shuffle.spill
to false
(defaults to true
).
I don't know if this explains all of what you are seeing. See http://spark.apache.org/docs/latest/configuration.html for the description of all such parameters.
monster
Updated on June 04, 2022Comments
-
monster almost 2 years
When running my program locally on a 16Gb MBP I get the following occurrences:
15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_3 15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB) 15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks 15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_6 15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks 15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 7.9 MB to disk (1 times so far) 15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (1 times so far) 15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 8.0 MB to disk (1 times so far) 15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (2 timess so far) 15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.8 MB to disk (1 times so far) 15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.2 MB to disk (2 timess so far) 15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.6 MB to disk (2 timess so far) 15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (3 timess so far) 15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (2 timess so far) 15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 61 spilling in-memory batch of 24.3 MB to disk (1 times so far) 15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.0 MB to disk (3 timess so far) 15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (3 timess so far) 15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (4 timess so far) 15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.3 MB to disk (3 timess so far) 15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (4 timess so far) 15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.2 MB to disk (5 timess so far) 15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.8 MB to disk (4 timess so far) 15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory batch of 35.6 MB to disk (1 times so far) 15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (4 timess so far) 15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (5 timess so far) 15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (6 timess so far) 15/04/10 20:07:53 INFO MemoryStore: ensureFreeSpace(872616) called with curMem=1345765155, maxMem=2061647216 15/04/10 20:07:53 INFO MemoryStore: Block rdd_12_2 stored as values in memory (estimated size 852.2 KB, free 681.9 MB) 15/04/10 20:07:53 INFO BlockManagerInfo: Added rdd_12_2 in memory on 192.168.1.4:60005 (size: 852.2 KB, free: 682.0 MB) 15/04/10 20:07:53 INFO BlockManagerMaster: Updated info of block rdd_12_2
My understanding is, is it has free memory, most of the memory is free in fact; given by:
15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB)
And yet it is spilling to disk? I'm using a ~265Mb dataset, so it really shouldn't need to be spilled to disk?
For what it's worth:
15/04/10 20:06:50 INFO MemoryStore: MemoryStore started with capacity 1966.1 MB
With all this spilling to disk it's taking ~5 minutes to run through my program once.
Why is this occurring?
-
monster about 9 yearsI set
spark.shuffle.spill
to false but the spilling is still occurring. I am confused as to why a 265mb dataset can't fit in memory. -
Daniel Darabos about 9 yearsIf that's the size on disk, it can easily blow up 20-30 times when it's loaded. At least in my application it tends to :). Java is not very frugal with memory.
-
monster about 9 yearsThe UI is showing some strange behaviour, maybe this is what you mean! It says the input in 6.6GB! Why does it seem to be replicating the data 20-30 times over? One part where I read in the data and the UI says input data is: 934.9 MB for just reading in a 265MB file!? It spills 56.7MB to disk. I did not expect this.
-
Daniel Darabos about 9 yearscs.virginia.edu/kim/publicity/pldi09tutorials/… is a good presentation on some of the surprising ways memory can be wasted. It doesn't necessarily apply here, but it's a good read anyway.
-
Mark about 7 yearsThis was the case for me as well. I am using Spark's RowMatrix.columnSimilarity(threshold) and having null values killed performance.
-
BdEngineer over 5 years@Daniel Darabos If spark.shuffle.spill set to false then if there is any shuffle requires memory then what will happen?
-
Miroslaw almost 4 yearsSame for me. I had: DF.repartition(100, columnA) .sortWithinPartitions('columnB') where columnA had just one value -> null