SparkSQL - Read parquet file directly

scala apache-spark hive apache-spark-sql hdfs

101,503

Solution 1

After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")

df.printSchema

// after registering as a table you will be able to run sql queries
df.registerTempTable("people")

sqlContext.sql("select * from people").collect.foreach(println)

Solution 2

With plain SQL

JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.

//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate

spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
   .show()

Solution 3

Suppose that you have the parquet file ventas4 in HDFS:

hdfs://localhost:9000/sistgestion/sql/ventas4

In this case, the steps are:

Charge the SQL Context:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Read the parquet File:

val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")

Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):

sqlContext.sql("select * from ventas").toJSON.foreach(println(_))

sqlContext.sql("select * from ventas").collect().foreach(println(_))

101,503

Edamame

Updated on September 08, 2020

Comments

Edamame over 3 years
I am migrating from Impala to SparkSQL, using the following code to read a table:
```
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
```
How do I invoke SparkSQL above, so it can return something like:
```
'select col_A, col_B from my_table'
```
Edamame over 7 years

Is collect necessary (or a good idea)? Because if the data is big, we don't want to collect everything to the driver?
bob over 7 years

its just an example how sql can be used. It depends on you how you wanna use it. you may change the query or do .take() also to get the required data on the driver
passionate over 6 years

I do see this error "File not found. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved." How do I resolve this?
passionate over 6 years

Doesn't help if I do spark.sqlContext().setConf("spark.sql.parquet.cacheMetadata"‌, "false");
Cherry over 6 years

Works! Just replace hdfs://my_hdfs_path/my_db.db/my_table with you file path. :)
MichaelChirico over 6 years

This is awesome, could you point to some further documentation on this type of behavior?
mrsrinivas over 5 years

I found it from spark code base in github. Not sure about documentation on it.
Union find almost 5 years

How do you set the schema with this approach?
mrsrinivas almost 5 years

Not possible to set schema explicitly. If schema not found then all columns are String type during dataframe creation.
Benjamin Du almost 5 years

Can we query a TSV file directly using Spark SQL?
aaronsteers over 4 years

Thanks so much for posting this example. I notice this works also with csv but I can't find documentation on the functionality anywhere. For options like csv, there are a couple parameters like headers, delimiters, etc, and I don't know how to configure them or change the defaults.
THIS USER NEEDS HELP over 3 years

btw Symbol SQLContext is deprecated. Use SparkSession.builder instead
sg1234 almost 3 years

This is great, thanks! Any idea how this can be adapted for parquet files split across sub-directories according to a partition (from Impala)? The field is not present in the actual parquet files since its value is in the format dir_name=value... Not sure how to include this into the dataframe or the sql query...
Ed Bighands almost 3 years

The question is to use SparkSQL. What this answer seems to be is structuredAPI and not inline with the ask. Also, I do not think that Intellij or any IDE is relevant here