How to check for intersection of two DataFrame columns in Spark

24,801

You need two Spark DataFrames to make use of the intersect function. You can use select function to get specific columns from each DataFrame.

In SparkR:

newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))

In pyspark:

newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name')) 
Share:
24,801
Gaurav Bansal
Author by

Gaurav Bansal

Updated on October 14, 2020

Comments

  • Gaurav Bansal
    Gaurav Bansal over 3 years

    Using either pyspark or sparkr (preferably both), how can I get the intersection of two DataFrame columns? For example, in sparkr I have the following DataFrames:

    newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
                           surname = c("Smith", "Williams", "Brown", "Taylor"))
    salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
                            surname = c("Martin", "Clark", "Williams"))
    newHiresDF <- createDataFrame(newHires)
    salesTeamDF <- createDataFrame(salesTeam)
    
    #Intersect works for the entire DataFrames
    newSalesHire <- intersect(newHiresDF, salesTeamDF)
    head(newSalesHire)
    
            name  surname
        1 George Williams
    
    #Intersect does not work for single columns
    newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
    head(newSalesHire)
    

    Error in as.vector(y) : no method for coercing this S4 class to a vector

    How can I get intersect to work for single columns?