Check every column in a spark dataframe has a certain value
12,494
Solution 1
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
Solution 2
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
Author by
Bharath
Updated on June 05, 2022Comments
-
Bharath almost 2 years
Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks, Sai
-
Ram Ghadiyaram over 6 yearswhat you have tried so far and what is result? and what was not working? seems like the question is abstract. provide code snippet of what you have tried
-
Bharath over 6 yearsHello Ram,Added the code snippet to the question.
-
-
Bharath over 6 yearsHello Andrei,Thank you for the response.I really appreciate it.However,I would like to check if the value in every column contains "Y" and not the column name.
-
Andrei Stankevich over 6 yearsfilter(s"$name =='Y'") This filter checks if the value of the column with name $name equals 'Y'. If you want to check 'contains' you can use filter(s"$name like '%Y%' "). Expression s"$name like '%Y%' checks the column value for specific column, not column name
-
Bharath over 6 yearsthank you Andrei.Will try that and let you know.Appreciate the effort.
-
Bharath over 6 yearsHello Andrei.I tried the solution but it doesn't seem to be doing the right thing.
-
Bharath over 6 yearsHello Sohum. Thank you. Though it works in parts,when I try to write it to a file it repeats the same output many times.
-
Andrei Stankevich over 6 yearsCan you post your code with explanation how you expect it to work?
-
Andrei Stankevich over 6 yearsThe issue with with solution is if you have some row and two columns of this row have 'Y' value you are going to have two same rows in a result dataframe. If three columns of a row have 'Y' value you are going to have three same rows in a result dataframe.
-
Michael Brenndoerfer about 5 yearsThe problem here is, that multiple filters are a concatenation of multiple AND, so you basically end up with rows, where the value in EVERY COLUMN is equal to Y