Spark Scala GroupBy column and sum values

15,630

Solution 1

This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:

sc.textFile("path to the text file")
 .map(x => x.split(" ",-1))
 .map(x => (x(0),x(3)))
 .countByKey

To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:

val result = df.groupBy("column to Group on").agg(count("column to count on"))

another possibility is to use the sql approach:

val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")

Solution 2

I assume that you have already have your RDD populated.

   //For simplicity, I build RDD this way
      val data = Seq(("aa", "File:Sleeping_lion.jpg", 1, 8030),
          ("aa", "Main_Page", 1, 78261),
          ("aa", "Special:Statistics", 1, 20493),
          ("aa.b", "User:5.34.97.97", 1, 4749),
          ("aa.b", "User:80.63.79.2", 1, 4751),
          ("af", "Blowback", 2, 16896),
          ("af", "Bluff", 2, 21442),
          ("en", "Huntingtown,_Maryland", 1, 0))

Dataframe approach

  val sql = new SQLContext(sc)        
  import sql.implicits._
  import org.apache.spark.sql.functions._

  val df = data.toDF("name", "title", "views", "size")      
  df.groupBy($"name").agg(count($"name") as "") show

**Result** 
+----+-----+
|name|count|    
+----+-----+    
|  aa|    3|    
|  af|    2|   
|aa.b|    2|    
|  en|    1|    
+----+-----+

RDD Approach (CountByKey(...))

rdd.keyBy(f => f._1).countByKey().foreach(println(_))

RDD Approach (reduceByKey(...))

rdd.map(f => (f._1, 1)).reduceByKey((accum, curr) => accum + curr).foreach(println(_))

If any of this does not solve your problem, pls share where exactely you have strucked.

Share:
15,630

Related videos on Youtube

Jeet Banerjee
Author by

Jeet Banerjee

Software Engineer

Updated on June 04, 2022

Comments

  • Jeet Banerjee
    Jeet Banerjee almost 2 years

    I am a newbie in Apache-spark and recently started coding in Scala.

    I have a RDD with 4 columns that looks like this: (Columns 1 - name, 2- title, 3- views, 4 - size)

    aa    File:Sleeping_lion.jpg 1 8030
    aa    Main_Page              1 78261
    aa    Special:Statistics     1 20493
    aa.b  User:5.34.97.97        1 4749
    aa.b  User:80.63.79.2        1 4751
    af    Blowback               2 16896
    af    Bluff                  2 21442
    en    Huntingtown,_Maryland  1 0
    

    I want to group based on Column Name and get the sum of Column views.

    It should be like this:

    aa   3
    aa.b 2
    af   2
    en   1
    

    I have tried to use groupByKey and reduceByKey but I am stuck and unable to proceed further.

    • Jacek Laskowski
      Jacek Laskowski about 6 years
      Why have you bet on RDD API if "I am a newbie in Apache-spark and recently started coding in Scala."? Why not Spark SQL's Dataframe API?
    • eugenio calabrese
      eugenio calabrese about 6 years
      I have improved my answer below to include two alternatives ways to achive the result.
  • Jeet Banerjee
    Jeet Banerjee about 6 years
    When I tried method 2 and 3, I got this 26: error: value _1 is not a member of Array[String]