Spark Scala GroupBy column and sum values
Solution 1
This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:
sc.textFile("path to the text file")
.map(x => x.split(" ",-1))
.map(x => (x(0),x(3)))
.countByKey
To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:
val result = df.groupBy("column to Group on").agg(count("column to count on"))
another possibility is to use the sql approach:
val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")
Solution 2
I assume that you have already have your RDD populated.
//For simplicity, I build RDD this way
val data = Seq(("aa", "File:Sleeping_lion.jpg", 1, 8030),
("aa", "Main_Page", 1, 78261),
("aa", "Special:Statistics", 1, 20493),
("aa.b", "User:5.34.97.97", 1, 4749),
("aa.b", "User:80.63.79.2", 1, 4751),
("af", "Blowback", 2, 16896),
("af", "Bluff", 2, 21442),
("en", "Huntingtown,_Maryland", 1, 0))
Dataframe approach
val sql = new SQLContext(sc)
import sql.implicits._
import org.apache.spark.sql.functions._
val df = data.toDF("name", "title", "views", "size")
df.groupBy($"name").agg(count($"name") as "") show
**Result**
+----+-----+
|name|count|
+----+-----+
| aa| 3|
| af| 2|
|aa.b| 2|
| en| 1|
+----+-----+
RDD Approach (CountByKey(...))
rdd.keyBy(f => f._1).countByKey().foreach(println(_))
RDD Approach (reduceByKey(...))
rdd.map(f => (f._1, 1)).reduceByKey((accum, curr) => accum + curr).foreach(println(_))
If any of this does not solve your problem, pls share where exactely you have strucked.
Related videos on Youtube
Comments
-
Jeet Banerjee almost 2 years
I am a newbie in Apache-spark and recently started coding in Scala.
I have a RDD with 4 columns that looks like this: (Columns 1 - name, 2- title, 3- views, 4 - size)
aa File:Sleeping_lion.jpg 1 8030 aa Main_Page 1 78261 aa Special:Statistics 1 20493 aa.b User:5.34.97.97 1 4749 aa.b User:80.63.79.2 1 4751 af Blowback 2 16896 af Bluff 2 21442 en Huntingtown,_Maryland 1 0
I want to group based on Column Name and get the sum of Column views.
It should be like this:
aa 3 aa.b 2 af 2 en 1
I have tried to use
groupByKey
andreduceByKey
but I am stuck and unable to proceed further.-
Jacek Laskowski about 6 yearsWhy have you bet on RDD API if "I am a newbie in Apache-spark and recently started coding in Scala."? Why not Spark SQL's Dataframe API?
-
eugenio calabrese about 6 yearsI have improved my answer below to include two alternatives ways to achive the result.
-
-
Jeet Banerjee about 6 yearsWhen I tried method 2 and 3, I got this 26: error: value _1 is not a member of Array[String]