Outer join Spark dataframe with non-identical join column and then merge join column

python join apache-spark apache-spark-sql

11,098

Solution 1

You can use coallesce function which returns the first not-null argument.

from pyspark.sql.functions import coalesce

df1 = df1.alias("df1")
df2 = df2.alias("df2")

(df1.join(df2, df1.name == df2.name, 'outer')
  .withColumn("name_", coalesce("df1.name", "df2.name"))
  .drop("name")
  .withColumnRenamed("name_", "name"))

Solution 2

This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:

df1.join(df2, 'name', 'outer')

Solution 3

df3 = df1.join(df2, ['name'], 'outer')

Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

11,098

Author by

plam

Updated on June 13, 2022

Comments

plam almost 2 years
Suppose I have the following dataframes in pySpark:
```
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
```
Now suppose I want to create df3 from joining/merging df1 and df2.

I tried doing
```
df1.join(df2, df1.name == df2.name, 'outer')
```
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.

How would I do that? Or is there a better way to create df3 from df1 and df2?