PySpark: How to check if list of string values exists in dataframe and print values to a list

14,097

In general looping through data in pyspark will not be very efficient. When possible use native pyspark functions. For your specific question you can use the filter function that will filter your DataFrame by names in the student list:

df_names.filter(col("name").isin(students)).select("name")

In your example the only return value will be John.

Share:
14,097

Related videos on Youtube

Techno04335
Author by

Techno04335

Always a student in this reality we call life :)

Updated on June 04, 2022

Comments

  • Techno04335
    Techno04335 over 1 year

    I have a df NAMES in which if I output via display(NAMES):

    NAMES
    
    John
    
    Sarah
    
    Michael
    
    Sean
    

    I also have a list students, print(students):

    {John, Alan, Andy}

    Question:

    Based on this list (students), how can I loop through the df with "NAMES" Column and output to another list the names of students who are in the list and also in the DF.

    Expected output of list: "John"

    I have tried

    list2 = []
    for i in NAMES:
         for g in students:
            if i == g:
              list2.append(i)
    

    but i end up with an error, how can i implement this via pyspark?

    Thanks.

    • Matt Messersmith
      Matt Messersmith over 5 years
      Why does this have to do with pyspark?
    • Matt Messersmith
      Matt Messersmith over 5 years
      What error did you get?