Where is the union() method on the Spark DataFrame class?

25,407

Is this intentional

If think it is safe to assume that it is intentional. Other union operators like RDD.union and DataSet.union will keep duplicates as well.

If you think about it make sense. While operation equivalent to UNION ALL is just a logical operation which requires no data access or network traffic finding distinct elements requires shuffle and because of that can be quite expensive.

is there a way to union two DataFrames without duplicates?

df1.unionAll(df2).distinct()
Share:
25,407
Milen Kovachev
Author by

Milen Kovachev

Updated on July 05, 2022

Comments

  • Milen Kovachev
    Milen Kovachev almost 2 years

    I am using the Java connector for Spark and would like to union two DataFrames but bizarrely the DataFrame class has only unionAll? Is this intentional and is there a way to union two DataFrames without duplicates?