Adding a Arraylist value to a new column in Spark Dataframe using Pyspark
My array is variable and I have to add it to multiple places with different value. This approach is fine for adding either same value or for adding one or two arrays. It will not suit for adding huge data
I believe it an XY-problem. If you want scalable solution (1000 rows in not huge to be honest), then use another dataframe and join. For example if want to connect by x1
arrays = spark.createDataFrame([
(1, [0.0, 0.0, 0.0]), (3, [0.0, 0.0, 0.0])
], ("x1", "x4"))
df.join(arrays, ["x1"])
Add more complex condition depending on the requirements.
To solve you're immediate problem see How to add a constant column in a Spark DataFrame? - all elements of array
should be columns
from pyspark.sql.functions import lit
array(lit(0.0), lit(0.0), lit(0.0))
# Column<b'array(0.0, 0.0, 0.0)'>
Related videos on Youtube
Abhijeet
Updated on June 04, 2022Comments
-
Abhijeet over 1 year
I want add a new column in my existing dataframe. Below is my dataframe -
+---+---+-----+ | x1| x2| x3| +---+---+-----+ | 1| a| 23.0| | 3| B|-23.0| +---+---+-----+
I am able to add
df = df.withColumn("x4", lit(0))
like this+---+---+-----+---+ | x1| x2| x3| x4| +---+---+-----+---+ | 1| a| 23.0| 0| | 3| B|-23.0| 0| +---+---+-----+---+
but I want to add a array list to my df.
Supose this
[0,0,0,0]
is my array to add and after adding my df will look like this -+---+---+-----+---------+ | x1| x2| x3| x4| +---+---+-----+---------+ | 1| a| 23.0|[0,0,0,0]| | 3| B|-23.0|[0,0,0,0]| +---+---+-----+---------+
I tried like this -
array_list = [0,0,0,0] df = df.withColumn("x4", lit(array_list))
But it is giving error
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit. : java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [0, 0, 0, 0, 0, 0]
Do anybody know how to do this?
-
mkaran almost 6 yearsPerhaps
df.withColumn("some_array", array(lit(0), lit(0), lit(0), lit(0))
? src -
Abhijeet almost 6 yearsBut what if I have to add different value to different row. It is not permanent solution.
-
mkaran almost 6 yearsIf you need a different value to a different row then you possibly need to use a
udf
. -
mkaran almost 6 yearsAnother thought is to use
when
:df.withColumn('some_array', when((df.some_column==1), array(lit(0), lit(0), lit(0), lit(0)).otherwise(array(lit(1), lit(1), lit(1), lit(1))
-
Steven almost 6 yearswhat does your array depend on ?
-
Abhijeet almost 6 yearsMy array is variable and I have to add it to multiple places with different value. This approach is fine for adding either same value or for adding one or two arrays. It will not suit for adding huge data like some 1000 rows.
-
-
Abhijeet almost 6 yearsFine got the point. But one more question what if i want to add different values to each row like this -
+---+---+-----+---------+ | x1| x2| x3| x4| +---+---+-----+---------+ | 1| a| 23.0|[0,1,2,3]| | 3| B|-23.0|[4,5,0,7]| | 4| C|-23.0|[8,0,1,0]| +---+---+-----+---------+