I have org.apache.spark.sql.DataFrame = [_1: string, _2: string] with column containing the value in below format.
Input
| _1 | _2 |
|---|---|
| 1 | 1,2,3,4 |
| 11 | 6,7,11,13 |
| 15 | 22 |
| 20 | 5,10,21,22,28,32 |
I want to find the average from column _2 and also count the number of elements in it as shown below. What will be the most efficient way since this logic will be applied to millions of rows
Output
| _1 | _2 | Average_2 | Count_2 |
|---|---|---|---|
| 1 | 1,2,3,4 | 2.5 | 4 |
| 11 | 6,7,11,13 | 9.25 | 4 |
| 15 | 22 | 22 | 1 |
| 20 | 5,10,21,22,28,32 | 19.66 | 6 |
CodePudding user response:
Split the string column then use array functions size aggregate:
val df2 = df
.withColumn("arr_2", split($"_2", ","))
.withcolumn("Count_2", size($"arr_2"))
.withcolumn(
"Average_2",
aggregate($"arr_2", lit(0), (s, x) => s x, s => s / $"Count_2")
).drop("arr_2")
