pyspark left outer join with multiple columns
10,912
This did the trick, seems you have to use an alias, similar what has been posted before, slightly simpler though in PySpark 2.1.0.
cr_outs = crimes.alias('a')\
.join(outcomes, crimes.CRIME_ID == outcomes.CRIME_ID, 'left_outer')\
.select(*[col('a.'+c) for c in crimes.columns]
+ [outcomes.FINAL_OUTCOME])
cr_outs.show()
cr_outs.printSchema()
--------+-------------------+--------------------+--------------------+--------------------+
| CRIME_ID|YEAR_MTH| REPORTED_BY| FALLS_WITHIN|LONGITUDE| LATITUDE| LOCATION|LSOA_CODE| LSOA_NAME| CRIME_TYPE| CURRENT_OUTCOME| FINAL_OUTCOME|
+--------------------+--------+--------------------+--------------------+---------+---------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+
|426085c2ed33af598...| 2017-01|City of London Po...|City of London Po...|-0.086051| 51.51357|On or near Finch ...|E01032739|City of London 001F| Other theft|Investigation com...|Investigation com...|
|33a3ddb8160a854a4...| 2017-01|City of London Po...|City of London Po...|-0.077777|51.518047|On or near Sandy'...|E01032
..
..
..
root
|-- CRIME_ID: string (nullable = true)
|-- YEAR_MTH: string (nullable = true)
|-- REPORTED_BY: string (nullable = true)
|-- FALLS_WITHIN: string (nullable = true)
|-- LONGITUDE: float (nullable = true)
|-- LATITUDE: float (nullable = true)
|-- LOCATION: string (nullable = true)
|-- LSOA_CODE: string (nullable = true)
|-- LSOA_NAME: string (nullable = true)
|-- CRIME_TYPE: string (nullable = true)
|-- CURRENT_OUTCOME: string (nullable = true)
|-- FINAL_OUTCOME: string (nullable = true)
As you can see, there are many more columns than my original post, but no duplicate columns and no renaming of columns either :-)
Author by
alortimor
Updated on June 27, 2022Comments
-
alortimor almost 2 years
I'm using Pyspark 2.1.0.
I'm attempting to perform a left outer join of two dataframes using the following: I have 2 dataframes, schema of which appear as follows:
crimes |-- CRIME_ID: string (nullable = true) |-- YEAR_MTH: string (nullable = true) |-- CRIME_TYPE: string (nullable = true) |-- CURRENT_OUTCOME: string (nullable = true) outcomes |-- CRIME_ID: string (nullable = true) |-- YEAR_MTH: string (nullable = true) |-- FINAL_OUTCOME: string (nullable = true)
I need to be able to join crimes to outcomes based on a left outer since many outcomes exist for a single crime. I would like to exclude columns that are common to both dataframes.
I have tried the following 2 ways, but each generate various errors:
cr_outs = crimes.join(outcomes, crimes.CRIME_ID == outcomes.CRIME_ID, 'left_outer')\ .select(['crimes.'+c for c in crimes.columns] + ['outcomes.FINAL_OUTCOME']) from pyspark.sql.functions as fn cr_outs = crimes.alias('a').join(outcomes.alias('b'), fn.col('b.CRIME_ID') = fn.col('a.CRIME_ID') ,'left_outer')\ .select([fn.col('a.'+ c) for c in a.columns] + b.FINAL_OUTCOME)
could anybody suggest an alternative way? thanks