How to create an EXTERNAL Spark table from data in HDFS

10,871

To create a Spark External table you must specify the "path" option of the DataFrameWriter. Something like this:

df.write.
  option("path","hdfs://user/zeppelin/my_mytable").
  saveAsTable("my_table")

The problem though, is that it will empty your hdfs path hdfs://user/zeppelin/my_mytable eliminating your existing files and will cause an org.apache.spark.SparkException: Job aborted.. This looks like a bug in Spark API...

Anyway, the workaround to this (tested in Spark 2.3) is to create an external table but from a Spark DDL. If your table have many columns creating the DDL could be a hassle. Fortunately, starting from Spark 2.0, you could call the DDL SHOW CREATE TABLE to let spark do the hard work. The problem is that you can actually run the SHOW CREATE TABLE in a persistent table.

If the table is pretty big, I recommend to get a sample of the table, persist it to another location, and then get the DDL. Something like this:

// Create a sample of the table 
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
df.limit(1).write.
    option("path", "/user/zeppelin/my_table_tmp").
    saveAsTable("my_table_tmp")

// Now get the DDL, do not truncate output
spark.sql("SHOW CREATE TABLE my_table_tmp").show(1, false)

You are going to get a DDL like:

CREATE TABLE `my_table_tmp` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
  `serialization.format` '1',
  path 'hdfs:///user/zeppelin/my_table_tmp')

Which you would want to change to have the original name of the table and the path to the original data. You can now run the following to create the Spark External table pointing to your existing HDFS data:

spark.sql("""
  CREATE TABLE `my_table` (`ID` INT, `Descr` STRING)
  USING parquet
  OPTIONS (
    `serialization.format` '1',
    path 'hdfs:///user/zeppelin/my_table')""")
Share:
10,871
Gabriel Avellaneda
Author by

Gabriel Avellaneda

Updated on June 27, 2022

Comments

  • Gabriel Avellaneda
    Gabriel Avellaneda almost 2 years

    I have loaded a parquet table from HDFS into a DataFrame:

    val df = spark.read.parquet("hdfs://user/zeppelin/my_table")

    I now want to expose this table to Spark SQL but this must be a persitent table because I want to access it from a JDBC connection or other Spark Sessions.

    Quick way could be to call df.write.saveAsTable method, but in this case it will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore, creating another copy of the data in HDFS.

    I don't want to have two copies of the same data, so I would want create like an external table to point to existing data.

  • y2k-shubham
    y2k-shubham over 5 years
    When I tried to specify "path" option (in Overwrite mode), I didn't get any SparkException. But the schema of resulting Hive EXTERNAL table was incorrect: it collapsed all columns into col array<string>
  • Gabriel Avellaneda
    Gabriel Avellaneda over 5 years
    @y2k-shubham are you using parquet format or csv? If you are using a CSV it is probably a problem with the field separator
  • y2k-shubham
    y2k-shubham over 5 years
    @Gabriel Avellaneda I was outputting it in ORC format. Looks like it wasn't just me. By the way, the straightforward solution was to run a Hive DROP TABLE.. command followed by CREATE EXTERNAL TABLE.. with proper schema.