PySpark: match the values of a DataFrame column against another DataFrame column
15,820
This kind of operation is called left semi join in spark:
df_B.join(df_A, ['col1'], 'leftsemi')
Related videos on Youtube
Comments
-
cwl about 1 year
In Pandas DataFrame, I can use
DataFrame.isin()
function to match the column values against another column.For example: suppose we have one DataFrame:
df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 'col2': [1, 2, 3, 4, 5, 6]}) df_A col1 col2 0 A 1 1 B 2 2 C 3 3 B 4 4 C 5 5 D 6
and another DataFrame:
df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 'col2': [10, 20, 30, 40, 50, 60, 70]}) df_B col1 col2 0 C 10 1 E 20 2 D 30 3 C 40 4 F 50 5 G 60 6 H 70
I can use
.isin()
function to match the column values ofdf_B
against the column values ofdf_A
E.g.:
df_B[df_B['col1'].isin(df_A['col1'])]
yields:
col1 col2 0 C 10 2 D 30 3 C 40
What's the equivalent operation in PySpark DataFrame?
df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 'col2': [1, 2, 3, 4, 5, 6]}) df_A = sqlContext.createDataFrame(df_A) df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 'col2': [10, 20, 30, 40, 50, 60, 70]}) df_B = sqlContext.createDataFrame(df_B) df_B[df_B['col1'].isin(df_A['col1'])]
The
.isin()
code above gives me an error messages:u'resolved attribute(s) col1#9007 missing from col1#9012,col2#9013L in operator !Filter col1#9012 IN (col1#9007);;\n!Filter col1#9012 IN (col1#9007)\n+- LogicalRDD [col1#9012, col2#9013L]\n'