Softmax function on a column groupby another column in Pyspark-CodePudding

I have a pyspark dataframe a below:

Variant	Category	Score	Record
A	915	11	Record-1
A	907	10	Record-2
A	914	10	Record-3
B	914	9	Record-1
B	907	2	Record-1

I want to calculate the softmax score of the Score column which is grouped by Variant column. this will lead the score for each variant to be total of 100 as below. the Variant can repeat for 3, 2 or 1 time row-wise.

Variant	Category	Score	Record	Softmax_Score
A	915	11	Record-1	0.35
A	907	10	Record-2	0.32
A	914	10	Record-3	0.32
B	914	9	Record-1	0.82
B	907	2	Record-1	0.18

I know we have function for softmax in python but not sure how to achieve this is Pyspark.

Softmax formula:

def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

Way to do it in Pandas:

test['Softmax_Score'] = test.groupby('Variant')['Score'].transform(softmax)

CodePudding user response：

You can calculate it the same way in pyspark using functions exp and sum over a window partitioned by Variant like this:

from pyspark.sql import functions as F

result = df.withColumn(
    "Softmax_Score",
    F.exp("Score") / F.sum(F.exp("Score")).over(Window.partitionBy("Variant"))
)

result.show()
#  ------- -------- ----- -------- -------------------- 
# |Variant|Category|Score|  Record|       Softmax_Score|
#  ------- -------- ----- -------- -------------------- 
# |      A|     915|   11|Record-1|   0.576116884765829|
# |      A|     907|   10|Record-2| 0.21194155761708544|
# |      A|     914|   10|Record-3| 0.21194155761708544|
# |      B|     914|    9|Record-1|  0.9990889488055994|
# |      B|     907|    2|Record-1|9.110511944006454E-4|
#  ------- -------- ----- -------- --------------------