Pyspark dataframe: Summing over a column while grouping over another

45,896

Solution 1

Why isn't it showing also the information from the first column?

Most likely because you're using outdated Spark 1.3.x. If thats the case you have to repeat grouping columns inside agg as follows:

(df
    .groupBy("order_item_order_id")
    .agg(func.col("order_item_order_id"), func.sum("order_item_subtotal"))
    .show())

Solution 2

A similar solution for your problem using PySpark 2.7.x would look like this:

df = spark.createDataFrame(
    [(1, 299.98),
    (2, 199.99),
    (2, 250.0),
    (2, 129.99),
    (4, 49.98),
    (4, 299.95),
    (4, 150.0),
    (4, 199.92),
    (5, 299.98),
    (5, 299.95),
    (5, 99.96),
    (5, 299.98)],
    ['order_item_order_id', 'order_item_subtotal'])

df.groupBy('order_item_order_id').sum('order_item_subtotal').show()

Which results in the following output:

+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
|                  5|       999.8700000000001|
|                  1|                  299.98|
|                  2|                  579.98|
|                  4|                  699.85|
+-------------------+------------------------+

Solution 3

You can use partition in a window function for that:

from pyspark.sql import Window

df.withColumn("value_field", f.sum("order_item_subtotal") \
  .over(Window.partitionBy("order_item_order_id"))) \
  .show()
Share:
45,896
Admin
Author by

Admin

Updated on July 28, 2022

Comments

  • Admin
    Admin almost 2 years

    I have a dataframe such as the following

    In [94]: prova_df.show()
    
    
    order_item_order_id order_item_subtotal
    1                   299.98             
    2                   199.99             
    2                   250.0              
    2                   129.99             
    4                   49.98              
    4                   299.95             
    4                   150.0              
    4                   199.92             
    5                   299.98             
    5                   299.95             
    5                   99.96              
    5                   299.98             
    

    What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:

    from pyspark.sql import functions as func
    prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()
    

    Which gives an output

    SUM('order_item_subtotal)
    129.99000549316406       
    579.9500122070312        
    199.9499969482422        
    634.819995880127         
    434.91000747680664 
    

    Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers

  • karthik r
    karthik r over 5 years
    what is value_field here?
  • information_interchange
    information_interchange over 4 years
    Just an arbitrary string that you want the column to be named