How to make sure my DataFrame frees its memory?

10,848

Solution 1

df.unpersist should be sufficient, but it won't necessarily free it right away. It merely marks the dataframe for removal.

You can use df.unpersist(blocking = true) which will block until the dataframe is removed before continuing on.

Solution 2

User of Spark has no way to manually trigger garbage collection.

Assigning df=null is not going to release much memory, because DataFrame does not hold data - it is just a description of computation.

If your application has memory issue have a look at Garbage Collection tuning guide. It has suggestion where to start and what can be changed to improve GC

Share:
10,848
belka
Author by

belka

Engineer looking for challenges in data-driven or blockchain businesses.

Updated on June 23, 2022

Comments

  • belka
    belka almost 2 years

    I have a Spark/Scala job in which I do this:

    • 1: Compute a big DataFrame df1 + cache it into memory
    • 2: Use df1 to compute dfA
    • 3: Read raw data into df2 (again, its big) + cache it

    When performing (3), I do no longer need df1. I want to make sure its space gets freed. I cached at (1) because this DataFrame gets used in (2) and its the only way to make sure I do not recompute it each time but only once.

    I need to free its space and make sure it gets freed. What are my options?

    I thought of these, but it doesn't seem to be sufficient:

    • df=null
    • df.unpersist()

    Can you document your answer with a proper Spark documentation link?

  • belka
    belka about 6 years
    Actually, you do have a way to make sure your object will be garbage collected: you set your variable (object reference) to null and at the next round of chekups, IF your object has no longer any other pointer referencing it, it should be garbage collected.
  • belka
    belka about 6 years
    Thanks for the update. Assume I cached my DataFrame, hence it does not contain a description of future computations to be done but the data, isn't it? + What does unpersist do in this case?
  • belka
    belka about 6 years
    Will this block the dataframe aside or the execution timeline?
  • puhlen
    puhlen about 6 years
    @belka it blocks execution (as in the thread will stop and wait for the dataframe to be uncached)
  • belka
    belka almost 6 years
    As far as my experience goes, this blocks the execution of all machines before resuming in this case.