Apache Spark spilling to disk

10,629

Solution 1

I found that one of my columns had nulls throughout causing a skew which resulted in constant spills.

Solution 2

There are different memory arenas in play. For caching Spark uses spark.storage.memoryFraction (defaults to 60%) of the heap. This is what most of the "free memory" messages are about. It uses spark.shuffle.memoryFraction (defaults to 20%) of the heap for shuffle. I think this is what the spill messages are about. You can disable shuffle spill entirely by setting spark.shuffle.spill to false (defaults to true).

I don't know if this explains all of what you are seeing. See http://spark.apache.org/docs/latest/configuration.html for the description of all such parameters.

Share:
10,629
monster
Author by

monster

Updated on June 04, 2022

Comments

  • monster
    monster almost 2 years

    When running my program locally on a 16Gb MBP I get the following occurrences:

    15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_3
    15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
    15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB)
    15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks
    15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
    15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_6
    15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
    15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks
    15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
    15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 7.9 MB to disk (1 times so far)
    15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (1 times so far)
    15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 8.0 MB to disk (1 times so far)
    15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (2 timess so far)
    15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.8 MB to disk (1 times so far)
    15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.2 MB to disk (2 timess so far)
    15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.6 MB to disk (2 timess so far)
    15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
    15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (2 timess so far)
    15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 61 spilling in-memory batch of 24.3 MB to disk (1 times so far)
    15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
    15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
    15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
    15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.3 MB to disk (3 timess so far)
    15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
    15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.2 MB to disk (5 timess so far)
    15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.8 MB to disk (4 timess so far)
    15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory batch of 35.6 MB to disk (1 times so far)
    15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
    15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (5 timess so far)
    15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (6 timess so far)
    15/04/10 20:07:53 INFO MemoryStore: ensureFreeSpace(872616) called with curMem=1345765155, maxMem=2061647216
    15/04/10 20:07:53 INFO MemoryStore: Block rdd_12_2 stored as values in memory (estimated size 852.2 KB, free 681.9 MB)
    15/04/10 20:07:53 INFO BlockManagerInfo: Added rdd_12_2 in memory on 192.168.1.4:60005 (size: 852.2 KB, free: 682.0 MB)
    15/04/10 20:07:53 INFO BlockManagerMaster: Updated info of block rdd_12_2
    

    My understanding is, is it has free memory, most of the memory is free in fact; given by:

    15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB)
    

    And yet it is spilling to disk? I'm using a ~265Mb dataset, so it really shouldn't need to be spilled to disk?

    For what it's worth:

    15/04/10 20:06:50 INFO MemoryStore: MemoryStore started with capacity 1966.1 MB
    

    With all this spilling to disk it's taking ~5 minutes to run through my program once.

    Why is this occurring?

  • monster
    monster about 9 years
    I set spark.shuffle.spill to false but the spilling is still occurring. I am confused as to why a 265mb dataset can't fit in memory.
  • Daniel Darabos
    Daniel Darabos about 9 years
    If that's the size on disk, it can easily blow up 20-30 times when it's loaded. At least in my application it tends to :). Java is not very frugal with memory.
  • monster
    monster about 9 years
    The UI is showing some strange behaviour, maybe this is what you mean! It says the input in 6.6GB! Why does it seem to be replicating the data 20-30 times over? One part where I read in the data and the UI says input data is: 934.9 MB for just reading in a 265MB file!? It spills 56.7MB to disk. I did not expect this.
  • Daniel Darabos
    Daniel Darabos about 9 years
    cs.virginia.edu/kim/publicity/pldi09tutorials/… is a good presentation on some of the surprising ways memory can be wasted. It doesn't necessarily apply here, but it's a good read anyway.
  • Mark
    Mark about 7 years
    This was the case for me as well. I am using Spark's RowMatrix.columnSimilarity(threshold) and having null values killed performance.
  • BdEngineer
    BdEngineer over 5 years
    @Daniel Darabos If spark.shuffle.spill set to false then if there is any shuffle requires memory then what will happen?
  • Miroslaw
    Miroslaw almost 4 years
    Same for me. I had: DF.repartition(100, columnA) .sortWithinPartitions('columnB') where columnA had just one value -> null