How to use join with many conditions in pyspark?

11,772

e.g.

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

But in your case, (summary.bucket)==9 should not appear as join condition

UPDATE:

In join condition you can use a list of Column join expression or a list of Column / column_name

Share:
11,772

Related videos on Youtube

Viv
Author by

Viv

Updated on May 26, 2022

Comments

  • Viv
    Viv over 1 year

    I am able to use the dataframe join statement with single on condition ( in pyspark) But, if I try to add multiple conditions, then It is failing.

    Code :

       summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").
    

    The above code works. However If I add some other condition for list like, summary.bucket == 9 or something, it fails. Please help me fix this issue.

       The error for the statement 
       summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")
    
       ERROR : TypeError: 'Column' object is not callable
    

    Edit :

    Adding full working example.

       schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
       bucket_summary = sqlContext.createDataFrame([],schema)
    
       temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
       bucket_summary = bucket_summary.unionAll(temp_county_prop)
       county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)
    

    Want to do a join on :

    category_id and bucket columns, I want to replace the values of county_prop on bucket_summary.

       cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]
    

    bucket_summary2 = bucket_summary.join(county_prop, cond, how = "leftouter")

       1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"]  --- THis too works.
    
       2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"] 
    

    It is not working. What can go wrong with the 2 statement?

    • mtoto
      mtoto about 6 years
      please provide a full reproducible example
    • Shaido
      Shaido about 6 years
      In your example, can't you simply do a .filter($"bucket" === 9) before performing the join?
    • Viv
      Viv about 6 years
      @mtoto, I ve added the example and updated the question with more findings.
  • Viv
    Viv about 6 years
    It works for bucket == 9 as well, only failure is when in the condition line if I write combination like : cond =["bucket", bucket_summary.category_id == "state"]