PySpark - Add a new column with a Rank by User
12,561
There is really no elegant solution here as for now. If you have to you can try something like this:
lookup = (sparkdf.select("user")
.distinct()
.orderBy("user")
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF(["user", "rank"]))
sparkdf.join(lookup, ["user"]).withColumn("rank", col("rank") + 1)
Window functions alternative is much more concise:
from pyspark.sql.functions import dense_rank
sparkdf.withColumn("rank", dense_rank().over(w))
but it is extremely inefficient and should be avoided in practice.
Author by
Kardu
Updated on June 09, 2022Comments
-
Kardu almost 2 years
I have this PySpark DataFrame
df = pd.DataFrame(np.array([ ["[email protected]",2,3], ["[email protected]",5,5], ["[email protected]",8,2], ["[email protected]",9,3] ]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
user movie rating [email protected] 2 3 [email protected] 5 5 [email protected] 8 2 [email protected] 9 3
I need to add a new column with a Rank by User
I want have this output
user movie rating Rank [email protected] 2 3 1 [email protected] 5 5 1 [email protected] 8 2 2 [email protected] 9 3 3
How can I do that?