How to join two data frames in Apache Spark and merge keys into one column?
21,242
Solution 1
You can use the equi-join synthax in Scala
val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
You should check if it works in python:
output = sales_df.join(target_df,['user_id'],"outer")
Solution 2
You need to perform an outer equi-join :
data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])
sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# | e| null| 4000|
# | d| 4400| 3000|
# | c| 3300| 2000|
# | b| 2100| 1000|
# | a| 1100| null|
# +-------+----------+-------------------+
Author by
chessosapiens
Updated on January 10, 2020Comments
-
chessosapiens over 4 years
I have two following Spark data frames:
sale_df: |user_id|total_sale| +-------+----------+ | a| 1100| | b| 2100| | c| 3300| | d| 4400
and target_df:
user_id|personalized_target| +-------+-------------------+ | b| 1000| | c| 2000| | d| 3000| | e| 4000| +-------+-------------------+
How can I join them in a way that output is:
user_id total_sale personalized_target a 1100 NA b 2100 1000 c 3300 2000 d 4400 4000 e NA 4000
I have tried all most all the join types but it seems that single join can not make the desired output.
Any PySpark or SQL and HiveContext can help.