Insert into hive table from spark sql

11,049

It always just seems to show up in the order written...

You are right. Spark works just like any SQL database would. The column names in the input dataset do not make any difference.
And since you do not explicitly map the output column to the input columns, Spark has to assume that the mapping is done by position.

Just meditate over the following test case...

hiveContext.sql("create temporary table TestTable (RunId string, Test1 string, Test2 string)")
hiveContext.sql("insert into table TestTable select 'A', 'x1', 'y1'")
hiveContext.sql("insert into table TestTable (RunId, Test1, Test2) select 'B', 'x2' as Blurb, 'y2' as Test1")
hiveContext.sql("insert into table TestTable (RunId, Test2, Test1) select 'C', 'x3' as Blurb, 'y3' as Test1")
data = hiveContext.sql("select 'xxx' as Test1, 'yyy' as Test2"))
data.registerTempTable("Dummy")
hiveContext.sql("insert into table TestTable(Test1, RunId, Test2) select Test1, 'D', Test2 from Dummy")
hiveContext.sql("insert into table TestTable select Test1, 'E', Test2 from Dummy")
hiveContext.sql("select * from TestTable").show(20)

Disclaimer - I did not actually test these commands, there are probably a couple of typos and syntax issues inside (especially since you do not mention your Hive and Spark versions) but you should see the point.

Share:
11,049
Luffen
Author by

Luffen

Updated on June 13, 2022

Comments

  • Luffen
    Luffen almost 2 years

    I am reading in some data from a json file and converting it to a string that I use to send my data to hive.

    The data is arriving fine in Hive, but it gets distributed in to the wrong columns, I have made a small example

    in Hive:

    Table name = TestTable, Column1 = test1, Column2 = test2`
    

    My code:

    data = hiveContext.sql("select \"hej\" as test1, \"med\" as test2")
    data.write.mode("append").saveAsTable("TestTable")
    
    data = hiveContext.sql("select \"hej\" as test2, \"med\" as test1")
    data.write.mode("append").saveAsTable("TestTable")
    

    this results in "hej" showing up in test1 both times and "med" showing up in test2 both times, instead of one showing up in each.

    It always just seems to show up in the order written and not go in to the mentioned columns that I mention with the 'as' keyword.

    Any one have any ideas?

  • Luffen
    Luffen over 7 years
    Hi and thanks for your answer. I am not sure I fully understand why "And since you do not explicitly map the output column to the input columns, Spark has to assume that the mapping is done by position." How would I explicitly do that mapping? I thought that was what I did by saying 'as test2'
  • Samson Scharfrichter
    Samson Scharfrichter over 7 years
    The Spark DataFrame has a specific "source" schema. The Hive table has a specific "target" schema. When using regular SQL with INSERT...SELECT the schema reconciliation is either explicit (c/o list of target columns in order, vs. source columns in order) or implicit (c/o positions of target and source columns). When using Spark API, well, Spark has to work exactly the same way, otherwise it would break the compatibility. But you cannot do the explicit mapping, so you are screwed. Bottom line, use SQL...
  • ROOT
    ROOT about 6 years
    @SamsonScharfrichter, can you please update your answer with update and delete commands too?