I have a pyspark dataframe a below:
| Variant | Category | Score | Record |
|---|---|---|---|
| A | 915 | 11 | Record-1 |
| A | 907 | 10 | Record-2 |
| A | 914 | 10 | Record-3 |
| B | 914 | 9 | Record-1 |
| B | 907 | 2 | Record-1 |
I want to calculate the softmax score of the Score column which is grouped by Variant column. this will lead the score for each variant to be total of 100 as below. the Variant can repeat for 3, 2 or 1 time row-wise.
| Variant | Category | Score | Record | Softmax_Score |
|---|---|---|---|---|
| A | 915 | 11 | Record-1 | 0.35 |
| A | 907 | 10 | Record-2 | 0.32 |
| A | 914 | 10 | Record-3 | 0.32 |
| B | 914 | 9 | Record-1 | 0.82 |
| B | 907 | 2 | Record-1 | 0.18 |
I know we have function for softmax in python but not sure how to achieve this is Pyspark.
Softmax formula:
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
Way to do it in Pandas:
test['Softmax_Score'] = test.groupby('Variant')['Score'].transform(softmax)
CodePudding user response:
You can calculate it the same way in pyspark using functions exp and sum over a window partitioned by Variant like this:
from pyspark.sql import functions as F
result = df.withColumn(
"Softmax_Score",
F.exp("Score") / F.sum(F.exp("Score")).over(Window.partitionBy("Variant"))
)
result.show()
# ------- -------- ----- -------- --------------------
# |Variant|Category|Score| Record| Softmax_Score|
# ------- -------- ----- -------- --------------------
# | A| 915| 11|Record-1| 0.576116884765829|
# | A| 907| 10|Record-2| 0.21194155761708544|
# | A| 914| 10|Record-3| 0.21194155761708544|
# | B| 914| 9|Record-1| 0.9990889488055994|
# | B| 907| 2|Record-1|9.110511944006454E-4|
# ------- -------- ----- -------- --------------------
