Pyspark - Aggregation on multiple columns

90,740
  1. Follow the instructions from the README to include spark-csv package
  2. Load data

    df = (sqlContext.read
        .format("com.databricks.spark.csv")
        .options(inferSchema="true", delimiter=";", header="true")
        .load("babynames.csv"))
    
  3. Import required functions

    from pyspark.sql.functions import count, avg
    
  4. Group by and aggregate (optionally use Column.alias:

    df.groupBy("year", "sex").agg(avg("percent"), count("*"))
    

Alternatively:

  • cast percent to numeric
  • reshape to a format ((year, sex), percent)
  • aggregateByKey using pyspark.statcounter.StatCounter
Share:
90,740

Related videos on Youtube

Mohan
Author by

Mohan

Updated on July 05, 2022

Comments

  • Mohan
    Mohan almost 2 years

    I have data like below. Filename:babynames.csv.

    year    name    percent     sex
    1880    John    0.081541    boy
    1880    William 0.080511    boy
    1880    James   0.050057    boy
    

    I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).

    year    sex   avg(percentage)   count(rows)
    1880    boy   0.070703         3
    

    I am not sure how to proceed after the following step in pyspark. Need your help on this

    testrdd = sc.textFile("babynames.csv");
    rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
    aggregatedoutput = ????
    
  • zero323
    zero323 over 6 years