Adding a Arraylist value to a new column in Spark Dataframe using Pyspark

12,735

Based on your comment

My array is variable and I have to add it to multiple places with different value. This approach is fine for adding either same value or for adding one or two arrays. It will not suit for adding huge data

I believe it an XY-problem. If you want scalable solution (1000 rows in not huge to be honest), then use another dataframe and join. For example if want to connect by x1

arrays = spark.createDataFrame([
    (1, [0.0, 0.0, 0.0]), (3, [0.0, 0.0, 0.0])
], ("x1", "x4"))


df.join(arrays, ["x1"])

Add more complex condition depending on the requirements.

To solve you're immediate problem see How to add a constant column in a Spark DataFrame? - all elements of array should be columns

from pyspark.sql.functions import lit

array(lit(0.0), lit(0.0), lit(0.0))
#  Column<b'array(0.0, 0.0, 0.0)'>
Share:
12,735

Related videos on Youtube

Abhijeet
Author by

Abhijeet

Updated on June 04, 2022

Comments

  • Abhijeet
    Abhijeet almost 2 years

    I want add a new column in my existing dataframe. Below is my dataframe -

    +---+---+-----+
    | x1| x2|   x3|
    +---+---+-----+
    |  1|  a| 23.0|
    |  3|  B|-23.0|
    +---+---+-----+
    

    I am able to add df = df.withColumn("x4", lit(0)) like this

    +---+---+-----+---+
    | x1| x2|   x3| x4|
    +---+---+-----+---+
    |  1|  a| 23.0|  0|
    |  3|  B|-23.0|  0|
    +---+---+-----+---+
    

    but I want to add a array list to my df.

    Supose this [0,0,0,0] is my array to add and after adding my df will look like this -

    +---+---+-----+---------+
    | x1| x2|   x3|       x4|
    +---+---+-----+---------+
    |  1|  a| 23.0|[0,0,0,0]|
    |  3|  B|-23.0|[0,0,0,0]|
    +---+---+-----+---------+
    

    I tried like this -

    array_list = [0,0,0,0]
    df = df.withColumn("x4", lit(array_list))
    

    But it is giving error

    py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
    : java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [0, 0, 0, 0, 0, 0]
    

    Do anybody know how to do this?

    • mkaran
      mkaran over 6 years
      Perhaps df.withColumn("some_array", array(lit(0), lit(0), lit(0), lit(0)) ? src
    • Abhijeet
      Abhijeet over 6 years
      But what if I have to add different value to different row. It is not permanent solution.
    • mkaran
      mkaran over 6 years
      If you need a different value to a different row then you possibly need to use a udf.
    • mkaran
      mkaran over 6 years
      Another thought is to use when : df.withColumn('some_array', when((df.some_column==1), array(lit(0), lit(0), lit(0), lit(0)).otherwise(array(lit(1), lit(1), lit(1), lit(1))
    • Steven
      Steven over 6 years
      what does your array depend on ?
    • Abhijeet
      Abhijeet over 6 years
      My array is variable and I have to add it to multiple places with different value. This approach is fine for adding either same value or for adding one or two arrays. It will not suit for adding huge data like some 1000 rows.
  • Abhijeet
    Abhijeet over 6 years
    Fine got the point. But one more question what if i want to add different values to each row like this - +---+---+-----+---------+ | x1| x2| x3| x4| +---+---+-----+---------+ | 1| a| 23.0|[0,1,2,3]| | 3| B|-23.0|[4,5,0,7]| | 4| C|-23.0|[8,0,1,0]| +---+---+-----+---------+