Partitioning by multiple columns in PySpark with columns in a list
32,361
Solution 1
Convert column names to column expressions with a list comprehension [col(x) for x in column_list]
:
from pyspark.sql.functions import col
column_list = ["col1","col2"]
win_spec = Window.partitionBy([col(x) for x in column_list])
Solution 2
Your first attempt should work.
Consider the following example:
import pyspark.sql.functions as f
from pyspark.sql import Window
df = sqlCtx.createDataFrame(
[
("a", "apple", 1),
("a", "orange", 2),
("a", "orange", 3),
("b", "orange", 3),
("b", "orange", 5)
],
["name", "fruit","value"]
)
df.show()
#+----+------+-----+
#|name| fruit|value|
#+----+------+-----+
#| a| apple| 1|
#| a|orange| 2|
#| a|orange| 3|
#| b|orange| 3|
#| b|orange| 5|
#+----+------+-----+
Suppose you wanted to calculate a fraction of the sum for each row, grouping by the first two columns:
cols = ["name", "fruit"]
w = Window.partitionBy(cols)
df.select(cols + [(f.col('value') / f.sum('value').over(w)).alias('fraction')]).show()
#+----+------+--------+
#|name| fruit|fraction|
#+----+------+--------+
#| a| apple| 1.0|
#| b|orange| 0.375|
#| b|orange| 0.625|
#| a|orange| 0.6|
#| a|orange| 0.4|
#+----+------+--------+
Solution 3
PySpark >= 2.4, this works too =>
column_list = ["col1","col2"]
win_spec = Window.partitionBy(*column_list)
Author by
prk
Updated on May 15, 2021Comments
-
prk almost 3 years
My question is similar to this thread: Partitioning by multiple columns in Spark SQL
but I'm working in Pyspark rather than Scala and I want to pass in my list of columns as a list. I want to do something like this:
column_list = ["col1","col2"] win_spec = Window.partitionBy(column_list)
I can get the following to work:
win_spec = Window.partitionBy(col("col1"))
This also works:
col_name = "col1" win_spec = Window.partitionBy(col(col_name))
And this also works:
win_spec = Window.partitionBy([col("col1"), col("col2")])
-
EnterPassword about 2 yearsUpdate for people coming to this answer: Newer versions of pyspark allow you to pass in a list, like the answers below. See @Naguveeru's answer.