PySpark - Add a new column with a Rank by User

12,561

There is really no elegant solution here as for now. If you have to you can try something like this:

lookup = (sparkdf.select("user")
    .distinct()
    .orderBy("user")
    .rdd
    .zipWithIndex()
    .map(lambda x: x[0] + (x[1], ))
    .toDF(["user", "rank"]))

sparkdf.join(lookup, ["user"]).withColumn("rank", col("rank") + 1)

Window functions alternative is much more concise:

from pyspark.sql.functions import dense_rank

sparkdf.withColumn("rank", dense_rank().over(w))

but it is extremely inefficient and should be avoided in practice.

Share:
12,561
Kardu
Author by

Kardu

Updated on June 09, 2022

Comments