Manually calling spark's garbage collection from pyspark

11,888

Solution 1

You never have to call manually the GC. If you had OOMException it's because there is no more memory available. You should look for memory leak, aka references you keep in your code. If you releases this references the JVM will make free space when needed.

Solution 2

I believe this will trigger a GC (hint) in the JVM:

spark.sparkContext._jvm.System.gc()

See also: How to force garbage collection in Java?

and: Java: How do you really force a GC using JVMTI's ForceGargabeCollection?

Share:
11,888
architectonic
Author by

architectonic

Updated on June 05, 2022

Comments

  • architectonic
    architectonic almost 2 years

    I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.

    Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:

    import gc
    collected = gc.collect()
    print "Garbage collector: collected %d objects." % collected
    

    This has helped a little.

    I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.

    Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.

  • architectonic
    architectonic over 8 years
    Yes I have read this many times but I still think that my case is eligible for manual GC for several reasons: a. It was beneficial to call the Python GC since it considers the number of garbage objects rather than their size, b. The nature of my application involves stages where no computation takes place while waiting for a user decision, and c. What if I need to run some memory-intensive python functionality or a completely different application? I doubt that the JVM gc would account for that
  • crak
    crak over 8 years
    If you have to run memory-intensive functionality on the jvm ( if don't know for python ), the vm will use all the memory you all it to use and if it's need more crash ( because the jvm respect your wish ;). Call the gc when there is no computing can be seen as a good idea, but this gc will be a full gc and full gc are slow very slow. In all case if you encounter a Out of Memory Exceptions it's not GC problem! It's a code problem ! Either a need for more memory or a memory leak.
  • Ami
    Ami almost 5 years
    Our experience is that we are getting OOMException when we increase the size of the Java heap on the Driver/Master node. Our theory is that with the smaller size, GC is happening more frequently.
  • crak
    crak almost 5 years
    It's a strange behavior. But indeed if you have less memory, it's will be filled quicker, so the gc will have to clean memory more frequently. Are you speaking about JVM OOM ? The driver memory should be keep low, the computation is made in worker. If you have a worker in the same serv than the driver, it's possible increase the memory of the driver limit the accessible memory of the worker leading to a OOM