Column alias after groupBy in pyspark
83,642
Solution 1
You can use agg
instead of calling max
method:
from pyspark.sql.functions import max
joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))
Similarly in Scala
import org.apache.spark.sql.functions.max
joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))
or
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))
Solution 2
This is because you are aliasing the whole DataFrame
object, not Column
. Here's an example how to alias the Column
only:
import pyspark.sql.functions as func
grpdf = joined_df \
.groupBy(temp1.datestamp) \
.max('diff') \
.select(func.col("max(diff)").alias("maxDiff"))
Solution 3
In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions
:
1
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.selectExpr('max(diff) AS maxDiff')
See docs for info on .selectExpr()
2
grouped_df = joined_df.groupBy(temp1.datestamp) \
.max('diff') \
.withColumnRenamed('max(diff)', 'maxDiff')
See docs for info on .withColumnRenamed()
This answer here goes into more detail: https://stackoverflow.com/a/34077809
Solution 4
you can use.
grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()
Author by
mhn
Updated on July 09, 2022Comments
-
mhn almost 2 years
I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.
grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")
-
mhn over 8 yearsyes.. this works.. except that func. is not required
-
Nhor over 8 yearsit's not if you have already imported
pyspark.sql.functions
to your script -
abeboparebop about 7 yearsI prefer this to the accepted answer because it doesn't require one to know in advance the name that Spark will give to the aggregated column.
-
serkan kucukbay over 6 yearsDid you try first code block? It seems wrong. Spark can not resolve column name with mentioned ways. In your code block, spark try to find diff column and try to run max function on given set but grouped_data doesn't contain any diff column, it contains temp1.datestamp and max(diff).
-
user1978816 over 4 yearsYou do have to be careful, as the following syntax does not work: rmse = df.groupBy('name').agg({'sqerr':'mean'}).alias('MSE') The alias command just seems to be ignored. But this works: df.groupBy('name').agg(F.mean(df.sqerr).alias('MSE'))
-
user1978816 over 4 yearsThe problem is you need to know what spark will clal the new column which might not be obvious. @zero323's anwser is better