Scala spark multiple values in a column average them-CodePudding

I have org.apache.spark.sql.DataFrame = [_1: string, _2: string] with column containing the value in below format.

Input

_1	_2
1	1,2,3,4
11	6,7,11,13
15	22
20	5,10,21,22,28,32

I want to find the average from column _2 and also count the number of elements in it as shown below. What will be the most efficient way since this logic will be applied to millions of rows

Output

_1	_2	Average_2	Count_2
1	1,2,3,4	2.5	4
11	6,7,11,13	9.25	4
15	22	22	1
20	5,10,21,22,28,32	19.66	6

CodePudding user response：

Split the string column then use array functions size aggregate:

val df2 = df
  .withColumn("arr_2", split($"_2", ","))
  .withcolumn("Count_2", size($"arr_2"))
  .withcolumn(
    "Average_2",
    aggregate($"arr_2", lit(0), (s, x) => s   x, s => s / $"Count_2")
  ).drop("arr_2")