Join two data frames, select all columns from one and some columns from the other

dataframe apache-spark pyspark apache-spark-sql

229,628

Solution 1

Not sure if the most efficient way, but this worked for me:

from pyspark.sql.functions import col

df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

The trick is in:

[col('a.'+xx) for xx in a.columns] : all columns in a

[col('b.other1'),col('b.other2')] : some columns of b

Solution 2

Asterisk (*) works with alias. Ex:

from pyspark.sql.functions import *

df1 = df1.alias('df1')
df2 = df2.alias('df2')

df1.join(df2, df1.id == df2.id).select('df1.*')

Solution 3

Without using alias.

df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

Solution 4

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.

a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])

c = a.join(b, a.a_id == b.b_id)

Then, c.show() yields:

+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
|   a|  foo|   p1|   a|
|   b|  hem|   p2|   b|
|   c|  haw|   p3|   c|
+----+-----+-----+----+

Solution 5

I believe that this would be the easiest and most intuitive way:

final = (df1.alias('df1').join(df2.alias('df2'),
                               on = df1['id'] == df2['id'],
                               how = 'inner')
                         .select('df1.*',
                                 'df2.other')
)

View more solutions

229,628

Author by

Admin

Updated on July 08, 2022

Comments

Admin almost 2 years
Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.

Is there a way to replicate the following command:
```
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
```
by using only pyspark functions such as join(), select() and the like?

I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.
Admin about 8 years

My question is exactly how to select all columns from one data frame (without enumerating them one by one) and one column from the other
void over 6 years

In spark2, I had to change this to col('b.id') == col('a.id') (with two equals signs). Otherwise, it gives me a 'SyntaxError: keyword can't be an expression' exception
Andre Odendaal about 6 years

perfect -- full solution; { df1.join(df2, df1.id == df2.id).select('df1.*', 'df2.other') }
Viv about 5 years

Well, the OP has asked for selection of only few cols, ie. filteration, the answer has all the columns after join.
cozek over 4 years

Changing the name of the variable should be obvious.
lampShadesDrifter over 4 years

I notice that when joined dataframes have same-named column names, doing df1["*"] in the select method correctly gets the columns from that dataframe even if df2 had columns with some of the same names as df1. Would you mind explaining (or linking to docs on) how this works?
Manu Sharma almost 4 years

Hi, How can I pass multiple columns as a list instead of individual cols like this [col('b.other1'),col('b.other2')] for df2 dataset
Sheldore about 3 years

You wrote df1 = df1.alias('df1') and df2 = df2.alias('df2'). What is the purpose here? You are renaming df1 as df1. Isn't this useless?
stormfield almost 3 years

@Sheldore see stackoverflow.com/a/46358218/1552998
hui chen about 2 years

Somehow this approach doesn't work on Spark 3 for me.