How to join datasets with same columns and select one?

20,263

Solution 1

which spark version you're using ? can you put a sample of your dataframes ? try this:

d2prim = d2.withColumnRenamed("columnName", d2_columnName)   
d1.join(d2prim , Seq("id"), "left_outer").select("columnName")

Solution 2

I have two dataframes

val d1 = spark.range(3).withColumn("columnName", lit("d1"))
scala> d1.printSchema
root
 |-- id: long (nullable = false)
 |-- columnName: string (nullable = false)

val d2 = spark.range(3).withColumn("columnName", lit("d2"))
scala> d2.printSchema
root
 |-- id: long (nullable = false)
 |-- columnName: string (nullable = false)

which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one.

val q1 = d1.as("d1")
  .join(d2.as("d2"), Seq("id"), "left")
  .select("d1.columnName")
scala> q1.show
+----------+
|columnName|
+----------+
|        d1|
|        d1|
|        d1|
+----------+

As you can see it just works.


So, why did it not work for you? Let's analyze each.

// you started very well
d1.as("d1")
  // but here you used $ to reference a column to join on
  // with column references by their aliases
  // that won't work
  .join(d2.as("d2"), $"d1.id" === $"d2.id", "left")
  // same here
  // $ + aliased columns won't work
  .select($"d1.columnName")

PROTIP: Use d1("columnName") to reference a specific column in a dataframe.

The other query was very close to be fine, but...

d1.join(d2, d1("id") === d2("id"), "left") // <-- so far so good!
  .select($"d1.columnName") // <-- that's the issue, i.e. $ + aliased column

Solution 3

This happens because when spark combines the columns from the two DataFrames it doesn't do any automatic renaming for you. You just need to rename one of the columns before joining. Spark provides a method for this. After the join you can drop the renamed column.

val df2join = df2.withColumnRenamed("id", "join_id")
val joined = df1.join(df2, $"id" === $"join_id", "left").drop("join_id")
Share:
20,263
Kratos
Author by

Kratos

Updated on December 30, 2020

Comments

  • Kratos
    Kratos over 3 years

    I have two Spark dataframes which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one. Therefore I am getting an Exception for ambiguous column.

    I have tried this:

    d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left").select($"d1.columnName")
    

    and this:

    d1.join(d2, d1("id") === d2("id"), "left").select($"d1.columnName")
    

    but it does not work.