Outer join Spark dataframe with non-identical join column and then merge join column

11,098

Solution 1

You can use coallesce function which returns the first not-null argument.

from pyspark.sql.functions import coalesce

df1 = df1.alias("df1")
df2 = df2.alias("df2")

(df1.join(df2, df1.name == df2.name, 'outer')
  .withColumn("name_", coalesce("df1.name", "df2.name"))
  .drop("name")
  .withColumnRenamed("name_", "name"))

Solution 2

This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:

df1.join(df2, 'name', 'outer')

Solution 3

df3 = df1.join(df2, ['name'], 'outer')

Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

Share:
11,098
plam
Author by

plam

Updated on June 13, 2022

Comments

  • plam
    plam almost 2 years

    Suppose I have the following dataframes in pySpark:

    df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
    df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
    df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
    

    Now suppose I want to create df3 from joining/merging df1 and df2.

    I tried doing

    df1.join(df2, df1.name == df2.name, 'outer')
    

    This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.

    How would I do that? Or is there a better way to create df3 from df1 and df2?