When to execute REFRESH TABLE my_table in spark?

apache-spark hive apache-spark-sql

11,274

You can run spark.catalog.refreshTable(tableName) or spark.sql(s"REFRESH TABLE $tableName") just before the write operation. I had same problem and it fixed my problem.

spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)

11,274

Author by

Cherry

Updated on July 09, 2022

Comments

Cherry almost 2 years

Consider a code;

 import org.apache.spark.sql.hive.orc._
 import org.apache.spark.sql._

 val path = ...
 val dataFrame:DataFramew = ...

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
 dataFrame.createOrReplaceTempView("my_table")
 val results = hiveContext.sql(s"select * from my_table")
 results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
 hiveContext.sql("REFRESH TABLE my_table")

This code is executed twice with same path but different dataFrames. The first run is successful, but subsequent rise en error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I have tried to clean up cache, invoke hiveContext.dropTempTable("tableName") and all have no effect. When to call REFRESH TABLE tableName before, after (other variants) to repair such error?