Home > Software engineering >  Softmax function on a column groupby another column in Pyspark
Softmax function on a column groupby another column in Pyspark

Time:02-02

I have a pyspark dataframe a below:

Variant Category Score Record
A 915 11 Record-1
A 907 10 Record-2
A 914 10 Record-3
B 914 9 Record-1
B 907 2 Record-1

I want to calculate the softmax score of the Score column which is grouped by Variant column. this will lead the score for each variant to be total of 100 as below. the Variant can repeat for 3, 2 or 1 time row-wise.

Variant Category Score Record Softmax_Score
A 915 11 Record-1 0.35
A 907 10 Record-2 0.32
A 914 10 Record-3 0.32
B 914 9 Record-1 0.82
B 907 2 Record-1 0.18

I know we have function for softmax in python but not sure how to achieve this is Pyspark.

Softmax formula:

def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

Way to do it in Pandas:

test['Softmax_Score'] = test.groupby('Variant')['Score'].transform(softmax)

CodePudding user response:

You can calculate it the same way in pyspark using functions exp and sum over a window partitioned by Variant like this:

from pyspark.sql import functions as F

result = df.withColumn(
    "Softmax_Score",
    F.exp("Score") / F.sum(F.exp("Score")).over(Window.partitionBy("Variant"))
)

result.show()
#  ------- -------- ----- -------- -------------------- 
# |Variant|Category|Score|  Record|       Softmax_Score|
#  ------- -------- ----- -------- -------------------- 
# |      A|     915|   11|Record-1|   0.576116884765829|
# |      A|     907|   10|Record-2| 0.21194155761708544|
# |      A|     914|   10|Record-3| 0.21194155761708544|
# |      B|     914|    9|Record-1|  0.9990889488055994|
# |      B|     907|    2|Record-1|9.110511944006454E-4|
#  ------- -------- ----- -------- -------------------- 
  •  Tags:  
  • Related