How to check for intersection of two DataFrame columns in Spark

apache-spark pyspark sparkr

24,801

You need two Spark DataFrames to make use of the intersect function. You can use select function to get specific columns from each DataFrame.

In SparkR:

newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))

In pyspark:

newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name'))

24,801

Author by

Gaurav Bansal

Updated on October 14, 2020

Comments

Gaurav Bansal over 3 years

Using either pyspark or sparkr (preferably both), how can I get the intersection of two DataFrame columns? For example, in sparkr I have the following DataFrames:

newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
                       surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
                        surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)

#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)

        name  surname
    1 George Williams

#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)

Error in as.vector(y) : no method for coercing this S4 class to a vector

How can I get intersect to work for single columns?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

Getting last value of group in Spark

Difference between createOrReplaceTempView and registerTempTable

PySpark: match the values of a DataFrame column against another DataFrame column

PySpark logging from the executor

"TypeError: an integer is required (got type bytes)" when importing pyspark on Python 3.8

Spark SQL(PySpark) - SparkSession import Error

convert dataframe to libsvm format

How to map features from the output of a VectorAssembler back to the column names in Spark ML?

How to check yarn logs application id

Adding a Arraylist value to a new column in Spark Dataframe using Pyspark