Check every column in a spark dataframe has a certain value

12,494

Solution 1

You can do something like this to keep the rows where all columns contain 'Y':

//Get all columns
val columns: Array[String] = df.columns

//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))

//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)

Solution 2

You can use data frame method columns to get all column's names

val columnNames: Array[String] = df.columns

and then add all filters in a loop

var filteredDf = df.select(join5.col("*"))

for(name <- columnNames) {
    filteredDf = filteredDf.filter(s"$name =='Y'")
}

or you can create a SQL query using same approach

Share:
12,494
Bharath
Author by

Bharath

Updated on June 05, 2022

Comments

  • Bharath
    Bharath almost 2 years

    Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?

    I have tried the following but don't think it is working properly.

    df.select(df.col("*")).filter("'*' =='Y'")
    

    Thanks, Sai

    • Ram Ghadiyaram
      Ram Ghadiyaram over 6 years
      what you have tried so far and what is result? and what was not working? seems like the question is abstract. provide code snippet of what you have tried
    • Bharath
      Bharath over 6 years
      Hello Ram,Added the code snippet to the question.
  • Bharath
    Bharath over 6 years
    Hello Andrei,Thank you for the response.I really appreciate it.However,I would like to check if the value in every column contains "Y" and not the column name.
  • Andrei Stankevich
    Andrei Stankevich over 6 years
    filter(s"$name =='Y'") This filter checks if the value of the column with name $name equals 'Y'. If you want to check 'contains' you can use filter(s"$name like '%Y%' "). Expression s"$name like '%Y%' checks the column value for specific column, not column name
  • Bharath
    Bharath over 6 years
    thank you Andrei.Will try that and let you know.Appreciate the effort.
  • Bharath
    Bharath over 6 years
    Hello Andrei.I tried the solution but it doesn't seem to be doing the right thing.
  • Bharath
    Bharath over 6 years
    Hello Sohum. Thank you. Though it works in parts,when I try to write it to a file it repeats the same output many times.
  • Andrei Stankevich
    Andrei Stankevich over 6 years
    Can you post your code with explanation how you expect it to work?
  • Andrei Stankevich
    Andrei Stankevich over 6 years
    The issue with with solution is if you have some row and two columns of this row have 'Y' value you are going to have two same rows in a result dataframe. If three columns of a row have 'Y' value you are going to have three same rows in a result dataframe.
  • Michael Brenndoerfer
    Michael Brenndoerfer about 5 years
    The problem here is, that multiple filters are a concatenation of multiple AND, so you basically end up with rows, where the value in EVERY COLUMN is equal to Y