How to check for intersection of two DataFrame columns in Spark
24,801
You need two Spark DataFrames to make use of the intersect function. You can use select function to get specific columns from each DataFrame.
In SparkR:
newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))
In pyspark:
newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name'))
Author by
Gaurav Bansal
Updated on October 14, 2020Comments
-
Gaurav Bansal over 3 years
Using either
pyspark
orsparkr
(preferably both), how can I get the intersection of twoDataFrame
columns? For example, insparkr
I have the followingDataFrames
:newHires <- data.frame(name = c("Thomas", "George", "George", "John"), surname = c("Smith", "Williams", "Brown", "Taylor")) salesTeam <- data.frame(name = c("Lucas", "Bill", "George"), surname = c("Martin", "Clark", "Williams")) newHiresDF <- createDataFrame(newHires) salesTeamDF <- createDataFrame(salesTeam) #Intersect works for the entire DataFrames newSalesHire <- intersect(newHiresDF, salesTeamDF) head(newSalesHire) name surname 1 George Williams #Intersect does not work for single columns newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name) head(newSalesHire)
Error in as.vector(y) : no method for coercing this S4 class to a vector
How can I get
intersect
to work for single columns?