PySpark: match the values of a DataFrame column against another DataFrame column

15,820

This kind of operation is called left semi join in spark:

df_B.join(df_A, ['col1'], 'leftsemi')
Share:
15,820

Related videos on Youtube

cwl
Author by

cwl

P(H|D) = P(D|H) P(H) / P(D)

Updated on October 24, 2022

Comments

  • cwl
    cwl about 1 year

    In Pandas DataFrame, I can use DataFrame.isin() function to match the column values against another column.

    For example: suppose we have one DataFrame:

    df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 
                         'col2': [1, 2, 3, 4, 5, 6]})
    df_A
    
        col1  col2
    0    A     1
    1    B     2
    2    C     3
    3    B     4
    4    C     5
    5    D     6         
    

    and another DataFrame:

    df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 
                         'col2': [10, 20, 30, 40, 50, 60, 70]})
    df_B
    
        col1  col2
    0    C    10
    1    E    20
    2    D    30
    3    C    40
    4    F    50
    5    G    60
    6    H    70       
    

    I can use .isin() function to match the column values of df_B against the column values of df_A

    E.g.:

    df_B[df_B['col1'].isin(df_A['col1'])]
    

    yields:

        col1  col2
    0    C    10
    2    D    30
    3    C    40
    

    What's the equivalent operation in PySpark DataFrame?

    df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 
                         'col2': [1, 2, 3, 4, 5, 6]})
    df_A = sqlContext.createDataFrame(df_A)
    
    df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 
                         'col2': [10, 20, 30, 40, 50, 60, 70]})
    df_B = sqlContext.createDataFrame(df_B)
    
    
    df_B[df_B['col1'].isin(df_A['col1'])]
    

    The .isin() code above gives me an error messages:

    u'resolved attribute(s) col1#9007 missing from 
    col1#9012,col2#9013L in operator !Filter col1#9012 IN 
    (col1#9007);;\n!Filter col1#9012 IN (col1#9007)\n+- 
    LogicalRDD [col1#9012, col2#9013L]\n'