I have RDD[(String, String, Int)] and want to use reduceByKey to get the result as shown below. I don't want it to convert to DF and then perform groupBy operation to get result. Int is constant with value as 1 always.
Is it possible to use reduceByKey here to get the result? Presenting it in Tabularformat for easy reading
Question
| String | String | Int |
|---|---|---|
| First | Apple | 1 |
| Second | Banana | 1 |
| First | Flower | 1 |
| Third | Tree | 1 |
Result
| String | String | Int |
|---|---|---|
| First | Apple,Flower | 2 |
| Second | Banana | 1 |
| Third | Tree | 1 |
CodePudding user response:
You can not use reduceByKey if you have a Tuple3, you could use reduceByKey though if you had RDD[(String, String)].
Also, once you groupBy, you can then apply reduceByKey, but since keys would be unique, it makes no sense to call reduceByKey, therefore we use map to map one on one values.
So, assume df is your main table, then this piece of code:
val rest = df.groupBy(x => x._1).map(x => {
val key = x._1 // ex: First
val groupedData = x._2 // ex: [(First, Apple, 1), (First, Flower, 1)]
// ex: [(First, Apple, 1), (First, Flower, 1)] => [Apple, Flower] => Apple, Flower
val concat = groupedData.map(d => d._2).mkString(",")
// ex: [(First, Apple, 1), (First, Flower, 1)] => [1, 1] => 2
val sum = groupedData.map(d => d._3).sum
(key, concat, sum) // return a tuple3 again, same format
})
Returns this result:
(Second,Banana,1)
(First,Apple,Flower,2)
(Third,Tree,1)
EDIT
Implementing reduceByKey without sum, if your dataset looks like:
val data = List(
("First", "Apple"),
("Second", "Banana"),
("First", "Flower"),
("Third", "Tree")
)
val df: RDD[(String, String)] = sparkSession.sparkContext.parallelize(data)
Then, this:
df.reduceByKey((acc, next) => acc "," next)
Gives this:
(First,Apple,Flower)
(Second,Banana)
(Third,Tree)
Good luck!
