Spark Join of 2 dataframes which have 2 different column names in list
13,517
Solution 1
If you expect two lists of strings:
val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")
Just zip and reduce:
import org.apache.spark.sql.functions.col
val on = leftOn.zip(rightOn)
.map { case (x, y) => df1(x) <=> df2(y) }
.reduce(_ and _)
df1.join(df2, on, "left")
Solution 2
You can easely define such a method yourself:
def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
import org.apache.spark.sql.functions.lit
val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
left.join(right, joinExpr, how)
}
val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")
val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")
Comments
-
Zzrot almost 2 years
Is there a way to
join
twoSpark Dataframes
with different column names via 2 lists?I know that if they had the same names in a list I could do the following:
val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")
or if I knew the different column names I could do this:
df1.join( df2, df1("col_a") <=> df2("col_x") && df1("col_b") <=> df2("col_y"), "left" )
Since my method is expecting inputs of 2 lists which specify which columns are to be used for the
join
for each DF, I was wondering if Scala Spark had a way of doing this?P.S I'm looking for something like a python
pandas merge
:joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')