Drop spark dataframe from cache
Solution 1
just do the following:
df1.unpersist()
df2.unpersist()
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
Solution 2
If the dataframe registered as a table for SQL operations, like
df.createGlobalTempView(tableName) // or some other way as per spark verision
then the cache can be dropped with following commands, off-course spark also does it automatically
Spark >= 2.x
Here spark
is an object of SparkSession
Drop a specific table/df from cache
spark.catalog.uncacheTable(tableName)
Drop all tables/dfs from cache
spark.catalog.clearCache()
Spark <= 1.6.x
Drop a specific table/df from cache
sqlContext.uncacheTable(tableName)
Drop all tables/dfs from cache
sqlContext.clearCache()
ankit patel
Updated on July 09, 2022Comments
-
ankit patel almost 2 years
I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;
df1.cache() df2.cache()
Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?
For example,
df1
is used through out the code whiledf2
is utilized for few transformations and after that, it is never needed. I want to forcefully dropdf2
to release more memory space. -
axlpado - Agile Lab over 8 yearsAnd pay attention to unpersist the df after the end of the lineage, so after the last action that involves the cached df.
-
spacedustpi almost 4 yearsI tried this for one of my dataframes "df" and when I did df.show(), df was still displaying data. When does it actually unpersist?
-
spacedustpi almost 4 yearsI tried these for my RDD 'df'. Why does type df.show() still display data?
-
mrsrinivas almost 4 yearsdf.show() will display data irrespective of cache, as long as the input source for the data frame is available.
-
Itération 122442 about 2 years@spacedustpi it removes the dataframe from the cache. (Somewhere in memory or on disk if not enough space in memory) By calling show, you triggered an action and then the computation has been done from the beginning to show you the data.