How to join two data frames in Apache Spark and merge keys into one column?

apache-spark dataframe join pyspark apache-spark-sql

21,242

Solution 1

You can use the equi-join synthax in Scala

  val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")

You should check if it works in python:

   output = sales_df.join(target_df,['user_id'],"outer")

Solution 2

You need to perform an outer equi-join :

data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])

sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# |      e|      null|               4000|
# |      d|      4400|               3000|
# |      c|      3300|               2000|
# |      b|      2100|               1000|
# |      a|      1100|               null|
# +-------+----------+-------------------+

21,242

Author by

chessosapiens

Updated on January 10, 2020

Comments

chessosapiens over 4 years

I have two following Spark data frames:

sale_df:

|user_id|total_sale|
+-------+----------+
|      a|      1100|
|      b|      2100|
|      c|      3300|
|      d|      4400

and target_df:

 user_id|personalized_target|
+-------+-------------------+
|      b|               1000|
|      c|               2000|
|      d|               3000|
|      e|               4000|
+-------+-------------------+

How can I join them in a way that output is:

user_id   total_sale   personalized_target
 a           1100            NA
 b           2100            1000
 c           3300            2000
 d           4400            4000
 e           NA              4000

I have tried all most all the join types but it seems that single join can not make the desired output.

Any PySpark or SQL and HiveContext can help.