Apache spark case with multiple when clauses on different columns
You can chain the when
similar to the example in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#when-org.apache.spark.sql.Column-java.lang.Object-
available since (1.4.0)
// Scala:
people.select(when(people("gender") === "male", 0)
.when(people("gender") === "female", 1)
.otherwise(2))
Your example:
val df1 = df.withColumn("Success",
when($"color" <=> "white", "Diamond")
.when($"size" > 10 && $"shape" === "Rhombus", "Diamond")
.otherwise(0))
Bandi LokeshReddy
Have 5 years of experience in application development and design using Hadoop ecosystem tools, Big Data, Core Java and Scala. Well-equipped to provide software solutions, deliver quality applications and give technical advice and an excellent team player with strong technical and communication skills
Updated on June 19, 2022Comments
-
Bandi LokeshReddy almost 2 years
Given the below structure:
val df = Seq("Color", "Shape", "Range","Size").map(Tuple1.apply).toDF("color") val df1 = df.withColumn("Success", when($"color"<=> "white", "Diamond").otherwise(0))
I want to write one more WHEN condition at above where size > 10 and Shape column value is Rhombus then "Diamond" value should be inserted to the column else 0. I tried like below but it's failing
val df1 = df.withColumn("Success", when($"color" <=> "white", "Diamond").otherwise(0)).when($"size">10)
Please suggest me with only dataframe option with scala. Spark-SQL with sqlContext is not helpful idea for me.
Thanks !
-
skdhfgeq2134 about 5 yearsIn the example, how can I set an alias for the generated column (from when statement)?
-
Pablo about 3 years-1 to this answer. AFAIK, you should always avoid using a UDF for any task that can be solved by chaining existing statements from the Structured API, no matter how long or complex your code looks like. The reason is that Spark's Catalyst optimizer will heavily improve your code when it is based on the Structured API, but it is blind when it finds a UDF, which is like a non-optimizable black box for Spark.