How to join two data frames in Apache Spark and merge keys into one column?

21,242

Solution 1

You can use the equi-join synthax in Scala

  val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")

You should check if it works in python:

   output = sales_df.join(target_df,['user_id'],"outer")

Solution 2

You need to perform an outer equi-join :

data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])

sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# |      e|      null|               4000|
# |      d|      4400|               3000|
# |      c|      3300|               2000|
# |      b|      2100|               1000|
# |      a|      1100|               null|
# +-------+----------+-------------------+
Share:
21,242
chessosapiens
Author by

chessosapiens

Updated on January 10, 2020

Comments

  • chessosapiens
    chessosapiens over 4 years

    I have two following Spark data frames:

    sale_df:
    
    |user_id|total_sale|
    +-------+----------+
    |      a|      1100|
    |      b|      2100|
    |      c|      3300|
    |      d|      4400  
    

    and target_df:

     user_id|personalized_target|
    +-------+-------------------+
    |      b|               1000|
    |      c|               2000|
    |      d|               3000|
    |      e|               4000|
    +-------+-------------------+
    

    How can I join them in a way that output is:

    user_id   total_sale   personalized_target
     a           1100            NA
     b           2100            1000
     c           3300            2000
     d           4400            4000
     e           NA              4000
    

    I have tried all most all the join types but it seems that single join can not make the desired output.

    Any PySpark or SQL and HiveContext can help.