Apache spark case with multiple when clauses on different columns

14,063

You can chain the when similar to the example in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#when-org.apache.spark.sql.Column-java.lang.Object- available since (1.4.0)

// Scala:
people.select(when(people("gender") === "male", 0)
 .when(people("gender") === "female", 1)
 .otherwise(2))

Your example:

val df1 = df.withColumn("Success",
  when($"color" <=> "white", "Diamond")
  .when($"size" > 10 && $"shape" === "Rhombus", "Diamond")
  .otherwise(0))
Share:
14,063
Bandi LokeshReddy
Author by

Bandi LokeshReddy

Have 5 years of experience in application development and design using Hadoop ecosystem tools, Big Data, Core Java and Scala. Well-equipped to provide software solutions, deliver quality applications and give technical advice and an excellent team player with strong technical and communication skills

Updated on June 19, 2022

Comments

  • Bandi LokeshReddy
    Bandi LokeshReddy almost 2 years

    Given the below structure:

    val df = Seq("Color", "Shape", "Range","Size").map(Tuple1.apply).toDF("color")
    
    val df1 = df.withColumn("Success", when($"color"<=> "white", "Diamond").otherwise(0))
    

    I want to write one more WHEN condition at above where size > 10 and Shape column value is Rhombus then "Diamond" value should be inserted to the column else 0. I tried like below but it's failing

    val df1 = df.withColumn("Success", when($"color" <=> "white", "Diamond").otherwise(0)).when($"size">10)
    

    Please suggest me with only dataframe option with scala. Spark-SQL with sqlContext is not helpful idea for me.

    Thanks !

  • skdhfgeq2134
    skdhfgeq2134 about 5 years
    In the example, how can I set an alias for the generated column (from when statement)?
  • Pablo
    Pablo about 3 years
    -1 to this answer. AFAIK, you should always avoid using a UDF for any task that can be solved by chaining existing statements from the Structured API, no matter how long or complex your code looks like. The reason is that Spark's Catalyst optimizer will heavily improve your code when it is based on the Structured API, but it is blind when it finds a UDF, which is like a non-optimizable black box for Spark.