When to execute REFRESH TABLE my_table in spark?

11,274

You can run spark.catalog.refreshTable(tableName) or spark.sql(s"REFRESH TABLE $tableName") just before the write operation. I had same problem and it fixed my problem.

spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)
Share:
11,274
Cherry
Author by

Cherry

Updated on July 09, 2022

Comments

  • Cherry
    Cherry almost 2 years

    Consider a code;

     import org.apache.spark.sql.hive.orc._
     import org.apache.spark.sql._
    
     val path = ...
     val dataFrame:DataFramew = ...
    
     val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
     dataFrame.createOrReplaceTempView("my_table")
     val results = hiveContext.sql(s"select * from my_table")
     results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
     hiveContext.sql("REFRESH TABLE my_table")
    

    This code is executed twice with same path but different dataFrames. The first run is successful, but subsequent rise en error:

    Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
    It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    

    I have tried to clean up cache, invoke hiveContext.dropTempTable("tableName") and all have no effect. When to call REFRESH TABLE tableName before, after (other variants) to repair such error?

  • rahul
    rahul almost 2 years
    The above solution only works if the table is not updated between the above 2 statements.