PySpark dataframe filter on multiple columns

31,436

Solution 1

doing the following should solve your issue

from pyspark.sql.functions import col
df.filter((!col("Name2").rlike("[0-9]")) | (col("Name2").isNotNull))

Solution 2

Should be as simple a putting multiple conditions into the filter.

val df = List(
  ("Naveen", "Srikanth"), 
  ("Naveen", "Srikanth123"), 
  ("Naveen", null), 
  ("Srikanth", "Naveen")).toDF("Name1", "Name2")

import spark.sqlContext.implicits._  
df.filter(!$"Name2".isNull && !$"Name2".rlike("[0-9]")).show

or if you prefer not use spark-sql $:

df.filter(!df("Name2").isNull && !df("Name2").rlike("[0-9]")).show 

or in Python:

df.filter(df["Name2"].isNotNull() & ~df["Name2"].rlike("[0-9]")).show()
Share:
31,436
user3292373
Author by

user3292373

Updated on July 06, 2022

Comments

  • user3292373
    user3292373 almost 2 years

    Using Spark 2.1.1

    Below is my data frame

    id Name1   Name2
    
    1 Naveen Srikanth 
    
    2 Naveen Srikanth123
    
    3 Naveen 
    
    4 Srikanth Naveen
    

    Now need to filter rows based on two conditions that is 2 and 3 need to be filtered out as name has number's 123 and 3 has null value

    using below code to filter only row id 2

    df.select("*").filter(df["Name2"].rlike("[0-9]")).show()
    

    got stuck up to include second condition.

  • user3292373
    user3292373 over 6 years
    Getting spark.sqlContext.implicits._ not found Michel and getting invalid operator && and !$ not allowing me to use
  • Michel Lemay
    Michel Lemay over 6 years
    This import is for '$' and works like a charm in scala REPL as I just tested. When you are in a full feature spark project, you can import it in a scope that have access to the spark session variable (org.apache.spark.sql.SparkSession).
  • user3292373
    user3292373 over 6 years
    Ya it is not working as I am using pyspark but syntax is in scala. not using scala. my environment is cloudera proj env
  • user3292373
    user3292373 over 6 years
    For pyspark also same syntax ? as I am using pyspark not scala
  • Michel Lemay
    Michel Lemay over 6 years
    maybe you shouldn't have tagged the post [scala] then.. I assumed you were familiar with both envs.
  • user3292373
    user3292373 over 6 years
    used in pyspark as from pyspark.sql.functions import * and snippet which was given Ramesh not working. Also the code that u have given works only for 123 but need it for any numeric number which is between [0-9]
  • Michel Lemay
    Michel Lemay over 6 years
    It should have been an && as in my example.
  • user3292373
    user3292373 over 6 years
    Thank you Ramesh . What you told was right but I got the answer with similar code of yours given df.select("*").filter(~df["Name2"].rlike("[0-9]"))
  • EntryLevelR
    EntryLevelR over 6 years
    Helpful to see the import col statement...other answers did not include this!