Join two data frames, select all columns from one and some columns from the other
229,628
Solution 1
Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col
df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
Solution 2
Asterisk (*
) works with alias. Ex:
from pyspark.sql.functions import *
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df1.join(df2, df1.id == df2.id).select('df1.*')
Solution 3
Without using alias.
df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])
Solution 4
Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.
a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
c = a.join(b, a.a_id == b.b_id)
Then, c.show()
yields:
+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
| a| foo| p1| a|
| b| hem| p2| b|
| c| haw| p3| c|
+----+-----+-----+----+
Solution 5
I believe that this would be the easiest and most intuitive way:
final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)
Author by
Admin
Updated on July 08, 2022Comments
-
Admin almost 2 years
Let's say I have a spark data frame
df1
, with several columns (among which the columnid
) and data framedf2
with two columns,id
andother
.Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as
join()
,select()
and the like?I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.
-
Admin about 8 yearsMy question is exactly how to select all columns from one data frame (without enumerating them one by one) and one column from the other
-
void over 6 yearsIn spark2, I had to change this to col('b.id') == col('a.id') (with two equals signs). Otherwise, it gives me a 'SyntaxError: keyword can't be an expression' exception
-
Andre Odendaal about 6 yearsperfect -- full solution; { df1.join(df2, df1.id == df2.id).select('df1.*', 'df2.other') }
-
Viv about 5 yearsWell, the OP has asked for selection of only few cols, ie. filteration, the answer has all the columns after join.
-
cozek over 4 yearsChanging the name of the variable should be obvious.
-
lampShadesDrifter over 4 yearsI notice that when joined dataframes have same-named column names, doing
df1["*"]
in the select method correctly gets the columns from that dataframe even ifdf2
had columns with some of the same names asdf1
. Would you mind explaining (or linking to docs on) how this works? -
Manu Sharma almost 4 yearsHi, How can I pass multiple columns as a list instead of individual cols like this [col('b.other1'),col('b.other2')] for df2 dataset
-
Sheldore about 3 yearsYou wrote
df1 = df1.alias('df1')
anddf2 = df2.alias('df2')
. What is the purpose here? You are renamingdf1
asdf1
. Isn't this useless? -
stormfield almost 3 years@Sheldore see stackoverflow.com/a/46358218/1552998
-
hui chen about 2 yearsSomehow this approach doesn't work on Spark 3 for me.