Join two data frames, select all columns from one and some columns from the other

229,628

Solution 1

Not sure if the most efficient way, but this worked for me:

from pyspark.sql.functions import col

df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

The trick is in:

[col('a.'+xx) for xx in a.columns] : all columns in a

[col('b.other1'),col('b.other2')] : some columns of b

Solution 2

Asterisk (*) works with alias. Ex:

from pyspark.sql.functions import *

df1 = df1.alias('df1')
df2 = df2.alias('df2')

df1.join(df2, df1.id == df2.id).select('df1.*')

Solution 3

Without using alias.

df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

Solution 4

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.

a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])

c = a.join(b, a.a_id == b.b_id)

Then, c.show() yields:

+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
|   a|  foo|   p1|   a|
|   b|  hem|   p2|   b|
|   c|  haw|   p3|   c|
+----+-----+-----+----+

Solution 5

I believe that this would be the easiest and most intuitive way:

final = (df1.alias('df1').join(df2.alias('df2'),
                               on = df1['id'] == df2['id'],
                               how = 'inner')
                         .select('df1.*',
                                 'df2.other')
)
Share:
229,628
Admin
Author by

Admin

Updated on July 08, 2022

Comments

  • Admin
    Admin almost 2 years

    Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.

    Is there a way to replicate the following command:

    sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
    

    by using only pyspark functions such as join(), select() and the like?

    I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

  • Admin
    Admin about 8 years
    My question is exactly how to select all columns from one data frame (without enumerating them one by one) and one column from the other
  • void
    void over 6 years
    In spark2, I had to change this to col('b.id') == col('a.id') (with two equals signs). Otherwise, it gives me a 'SyntaxError: keyword can't be an expression' exception
  • Andre Odendaal
    Andre Odendaal about 6 years
    perfect -- full solution; { df1.join(df2, df1.id == df2.id).select('df1.*', 'df2.other') }
  • Viv
    Viv about 5 years
    Well, the OP has asked for selection of only few cols, ie. filteration, the answer has all the columns after join.
  • cozek
    cozek over 4 years
    Changing the name of the variable should be obvious.
  • lampShadesDrifter
    lampShadesDrifter over 4 years
    I notice that when joined dataframes have same-named column names, doing df1["*"] in the select method correctly gets the columns from that dataframe even if df2 had columns with some of the same names as df1. Would you mind explaining (or linking to docs on) how this works?
  • Manu Sharma
    Manu Sharma almost 4 years
    Hi, How can I pass multiple columns as a list instead of individual cols like this [col('b.other1'),col('b.other2')] for df2 dataset
  • Sheldore
    Sheldore about 3 years
    You wrote df1 = df1.alias('df1') and df2 = df2.alias('df2'). What is the purpose here? You are renaming df1 as df1. Isn't this useless?
  • stormfield
    stormfield almost 3 years
  • hui chen
    hui chen about 2 years
    Somehow this approach doesn't work on Spark 3 for me.