Pyspark filter dataframe by columns of another dataframe
Left anti join is what you're looking for:
df1.join(df2, ["userid", "group"], "leftanti")
but the same thing can be done with left outer join:
(df1
.join(df2, ["userid", "group"], "leftouter")
.where(df2["pick"].isNull())
.drop(df2["pick"]))
drewyupdrew
Hawaii | NYC I love everything about this stackexchange community. I love how people are so willing to spend their time helping random strangers, and I love to help out others whenever I can. Most of all though, I'm here to learn. "If you don't have something nice to say, don't say it"
Updated on December 14, 2021Comments
-
drewyupdrew over 2 years
Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe
toPandas()
loads all the data into the driver’s memory in pyspark.I have 2 dataframes:
df1
anddf2
. I want to filterdf1
(remove all rows) wheredf1.userid = df2.userid
ANDdf1.group = df2.group
. I wasn't sure if I should usefilter()
,join()
, orsql
For example:df1: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 348| 2|[225, 2235, 2225] | | 567| 1|[1110, 1150] | | 595| 1|[1150, 1150, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+ df2: +------+----------+---------+ |userid| group | pick | +------+----------+---------+ | 348| 2| 2270| | 595| 1| 2125| +------+----------+---------+ Result I want: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 567| 1|[1110, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+
EDIT: I've tried many join() and filter() functions, I believe the closest I got was:
cond = [df1.userid == df2.userid, df2.group == df2.group] df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
I tried a bunch of different join types, and I also tried different
cond values: cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows
However, it seems like the joins are adding additional rows, rather than deleting.
I'm using
python 2.7
andspark 2.1.0