SparkSQL - Read parquet file directly

101,503

Solution 1

After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")

df.printSchema

// after registering as a table you will be able to run sql queries
df.registerTempTable("people")

sqlContext.sql("select * from people").collect.foreach(println)

Solution 2

With plain SQL

JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.

//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate

spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
   .show()

Solution 3

Suppose that you have the parquet file ventas4 in HDFS:

hdfs://localhost:9000/sistgestion/sql/ventas4

In this case, the steps are:

  1. Charge the SQL Context:

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    
  2. Read the parquet File:

    val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
    
  3. Register a temporal table:

    ventas.registerTempTable("ventas")
    
  4. Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):

    sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
    
    sqlContext.sql("select * from ventas").collect().foreach(println(_))
    
Share:
101,503

Related videos on Youtube

Edamame
Author by

Edamame

Updated on September 08, 2020

Comments

  • Edamame
    Edamame over 3 years

    I am migrating from Impala to SparkSQL, using the following code to read a table:

    my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
    

    How do I invoke SparkSQL above, so it can return something like:

    'select col_A, col_B from my_table'
    
  • Edamame
    Edamame over 7 years
    Is collect necessary (or a good idea)? Because if the data is big, we don't want to collect everything to the driver?
  • bob
    bob over 7 years
    its just an example how sql can be used. It depends on you how you wanna use it. you may change the query or do .take() also to get the required data on the driver
  • passionate
    passionate over 6 years
    I do see this error "File not found. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved." How do I resolve this?
  • passionate
    passionate over 6 years
    Doesn't help if I do spark.sqlContext().setConf("spark.sql.parquet.cacheMetadata"‌​, "false");
  • Cherry
    Cherry over 6 years
    Works! Just replace hdfs://my_hdfs_path/my_db.db/my_table with you file path. :)
  • MichaelChirico
    MichaelChirico over 6 years
    This is awesome, could you point to some further documentation on this type of behavior?
  • mrsrinivas
    mrsrinivas over 5 years
    I found it from spark code base in github. Not sure about documentation on it.
  • Union find
    Union find almost 5 years
    How do you set the schema with this approach?
  • mrsrinivas
    mrsrinivas almost 5 years
    Not possible to set schema explicitly. If schema not found then all columns are String type during dataframe creation.
  • Benjamin Du
    Benjamin Du almost 5 years
    Can we query a TSV file directly using Spark SQL?
  • aaronsteers
    aaronsteers over 4 years
    Thanks so much for posting this example. I notice this works also with csv but I can't find documentation on the functionality anywhere. For options like csv, there are a couple parameters like headers, delimiters, etc, and I don't know how to configure them or change the defaults.
  • THIS USER NEEDS HELP
    THIS USER NEEDS HELP over 3 years
    btw Symbol SQLContext is deprecated. Use SparkSession.builder instead
  • sg1234
    sg1234 almost 3 years
    This is great, thanks! Any idea how this can be adapted for parquet files split across sub-directories according to a partition (from Impala)? The field is not present in the actual parquet files since its value is in the format dir_name=value... Not sure how to include this into the dataframe or the sql query...
  • Ed Bighands
    Ed Bighands almost 3 years
    The question is to use SparkSQL. What this answer seems to be is structuredAPI and not inline with the ask. Also, I do not think that Intellij or any IDE is relevant here