spark join raises "Detected cartesian product for INNER join"
Solution 1
As described in Why does spark think this is a cross/cartesian join, it may be caused by:
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.
As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.
Solution 2
Try to persist the dataframes before joining them. Worked for me.
Solution 3
I've faced the same problem with cartesian product for my join. In order to overcome it I used aliases on DataFrames. See example
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))
Alex Loo
Updated on July 28, 2022Comments
-
Alex Loo over 1 year
I have a dataframe and I want to add for each row
new_col=max(some_column0)
grouped by some other column1:maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid") df0.join(maxs, df0.catalog == maxs.catalogid).take(4)
And in second string I get an error:
AnalysisException: u'Detected cartesian product for INNER join between logical plans\nProject ... Use the CROSS JOIN syntax to allow cartesian products between these relations.;'
What do I not understand: why spark finds here cartesian product?
A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.
-
CertainPerformance over 5 yearsBest to include all relevant information in your answer itself, not just in a link - links may rot, but answer text does not (hopefully)
-
Surender Raja almost 2 yearsIt seems it works in Pyspark with alias , but in scala dataframe API, it is not working