SparkSQL - Read parquet file directly
101,503
Solution 1
After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries
on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)
Solution 2
With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()
Solution 3
Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_)) sqlContext.sql("select * from ventas").collect().foreach(println(_))
Related videos on Youtube
Author by
Edamame
Updated on September 08, 2020Comments
-
Edamame over 3 years
I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'
-
Edamame over 7 yearsIs collect necessary (or a good idea)? Because if the data is big, we don't want to collect everything to the driver?
-
bob over 7 yearsits just an example how sql can be used. It depends on you how you wanna use it. you may change the query or do .take() also to get the required data on the driver
-
passionate over 6 yearsI do see this error "File not found. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved." How do I resolve this?
-
passionate over 6 yearsDoesn't help if I do spark.sqlContext().setConf("spark.sql.parquet.cacheMetadata", "false");
-
Cherry over 6 yearsWorks! Just replace
hdfs://my_hdfs_path/my_db.db/my_table
with you file path. :) -
MichaelChirico over 6 yearsThis is awesome, could you point to some further documentation on this type of behavior?
-
mrsrinivas over 5 yearsI found it from spark code base in github. Not sure about documentation on it.
-
Union find almost 5 yearsHow do you set the schema with this approach?
-
mrsrinivas almost 5 yearsNot possible to set schema explicitly. If schema not found then all columns are
String
type during dataframe creation. -
Benjamin Du almost 5 yearsCan we query a TSV file directly using Spark SQL?
-
aaronsteers over 4 yearsThanks so much for posting this example. I notice this works also with
csv
but I can't find documentation on the functionality anywhere. For options like csv, there are a couple parameters like headers, delimiters, etc, and I don't know how to configure them or change the defaults. -
THIS USER NEEDS HELP over 3 yearsbtw
Symbol SQLContext is deprecated. Use SparkSession.builder instead
-
sg1234 almost 3 yearsThis is great, thanks! Any idea how this can be adapted for parquet files split across sub-directories according to a partition (from Impala)? The field is not present in the actual parquet files since its value is in the format dir_name=value... Not sure how to include this into the dataframe or the sql query...
-
Ed Bighands almost 3 yearsThe question is to use SparkSQL. What this answer seems to be is structuredAPI and not inline with the ask. Also, I do not think that Intellij or any IDE is relevant here