How does createOrReplaceTempView work in Spark?
Solution 1
createOrReplaceTempView
creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view.
scala> val s = Seq(1,2,3).toDF("num")
s: org.apache.spark.sql.DataFrame = [num: int]
scala> s.createOrReplaceTempView("nums")
scala> spark.table("nums")
res22: org.apache.spark.sql.DataFrame = [num: int]
scala> spark.table("nums").cache
res23: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]
scala> spark.table("nums").count
res24: Long = 3
The data is cached fully only after the .count
call. Here's proof it's been cached:
Related SO: spark createOrReplaceTempView vs createGlobalTempView
Relevant quote (comparing to persistent table): "Unlike the createOrReplaceTempView command, saveAsTable will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore." from https://spark.apache.org/docs/latest/sql-programming-guide.html#saving-to-persistent-tables
Note : createOrReplaceTempView
was formerly registerTempTable
Solution 2
CreateOrReplaceTempView
will create a temporary view of the table on memory it is not persistent at this moment but you can run SQL query on top of that. if you want to save it you can either persist or use saveAsTable
to save.
First, we read data in .csv format and then convert to data frame and create a temp view
Reading data in .csv format
val data = spark.read.format("csv").option("header","true").option("inferSchema","true").load("FileStore/tables/pzufk5ib1500654887654/campaign.csv")
Printing the schema
data.printSchema
data.createOrReplaceTempView("Data")
Now we can run SQL queries on top of the table view we just created
%sql SELECT Week AS Date, Campaign Type, Engagements, Country FROM Data ORDER BY Date ASC
Solution 3
SparkSQl support writing programs using Dataset and Dataframe API, along with it need to support sql.
In order to support Sql on DataFrames, first it requires a table definition with column names are required, along with if it creates tables the hive metastore will get lot unnecessary tables, because Spark-Sql natively resides on hive. So it will create a temporary view, which temporarily available in hive for time being and used as any other hive table, once the Spark Context stop it will be removed.
In order to create the view, developer need an utility called createOrReplaceTempView
Abir Chokraborty
Updated on June 30, 2021Comments
-
Abir Chokraborty almost 3 years
I am new to Spark and Spark SQL.
How does
createOrReplaceTempView
work in Spark?If we register an
RDD
of objects as a table will spark keep all the data in memory? -
Abir Chokraborty almost 7 yearsDoes this
view
create a temporary table? And, does theview
have any specific name in Spark like hive table? -
Garren S almost 7 yearsYes it it is practically a table, except it has to be evaluated because it's not materialized to files. You can use the view just like you would a hive table (such as in a raw SQL query)
-
Garren S almost 7 yearsUnlike a traditional temp table, a temp view is NOT materialized at all even to memory. It's useful for accessing data in SQL but understand that its statements have to be evaluated EVERY time it's accessed
-
Andrey over 5 years@GarrenS how do i cache val df = sparkSession.sqlContext.sql("SELECT * FROM table) df.createOrReplaceTempView("trans_union") Cannot find out how can I cache it?
-
Garren S over 5 yearsFirst, use sparkSession.sql and skip sqlContext. Second, df.cache() or CACHE TABLE in SQL
-
MichaelChirico over 5 yearsshouldn't that be
.cache()
and.count()
? -
Garren S over 5 years@MichaelChirico in Python yes, the parentheses are required. In Scala, they are optional. Or were you referring to another facet of those calls?
-
MichaelChirico over 5 yearsAh, in fact I somehow missed the big neon word
scala
3 characters away... -
linello over 2 yearsIt's kind of a way to "move" the dataframe for the visibility scope of the current spark scala workspace to make it visibile to the SQL syntax then? No data copy is made right?