Pyspark filter dataframe by columns of another dataframe

38,471

Left anti join is what you're looking for:

df1.join(df2, ["userid", "group"], "leftanti")

but the same thing can be done with left outer join:

(df1
    .join(df2, ["userid", "group"], "leftouter")
    .where(df2["pick"].isNull())
    .drop(df2["pick"]))
Share:
38,471
drewyupdrew
Author by

drewyupdrew

Hawaii | NYC   I love everything about this stackexchange community. I love how people are so willing to spend their time helping random strangers, and I love to help out others whenever I can. Most of all though, I'm here to learn.     "If you don't have something nice to say, don't say it"

Updated on December 14, 2021

Comments

  • drewyupdrew
    drewyupdrew over 2 years

    Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.

    I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:

    df1:
    +------+----------+--------------------+
    |userid|   group  |      all_picks     |
    +------+----------+--------------------+
    |   348|         2|[225, 2235, 2225]   |
    |   567|         1|[1110, 1150]        |
    |   595|         1|[1150, 1150, 1150]  |
    |   580|         2|[2240, 2225]        |
    |   448|         1|[1130]              |
    +------+----------+--------------------+
    
    df2:
    +------+----------+---------+
    |userid|   group  |   pick  |
    +------+----------+---------+
    |   348|         2|     2270|
    |   595|         1|     2125|
    +------+----------+---------+
    
    Result I want:
    +------+----------+--------------------+
    |userid|   group  |      all_picks     |
    +------+----------+--------------------+
    |   567|         1|[1110, 1150]        |
    |   580|         2|[2240, 2225]        |
    |   448|         1|[1130]              |
    +------+----------+--------------------+
    

    EDIT: I've tried many join() and filter() functions, I believe the closest I got was:

    cond = [df1.userid == df2.userid, df2.group == df2.group]
    df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
    

    I tried a bunch of different join types, and I also tried different

    cond values:
        cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows
        cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows
    

    However, it seems like the joins are adding additional rows, rather than deleting.

    I'm using python 2.7 and spark 2.1.0