Get max value of multiple group by in PySpark-CodePudding

I have a pyspark dataframe with this cols and values:

|employee|department|salary|
 -------- ---------- ------ 
|   James|     Sales| 90000|
| Michael|     Sales| 86000|
|  Robert|     Sales| 81000|
|   Maria|   Finance| 90000|
|   Raman|   Finance| 99000|
|   Scott|   Finance| 83000|
|    Jeff| Marketing| 80000|
|   Kumar| Marketing| 91000|
|   Sarah| Marketing| 95000|

I want to have a dataframe with the employee with the highest salary for each department.

So my result will be :

|employee|department|highest salary|
 -------- ---------- --------------- 
|   James|     Sales|         90000|
|   Raman|   Finance|         99000|
|   Sarah| Marketing|         95000|

I already tried multiple things with groupby but I don't know what to do.

Thank you for your help.

CodePudding user response：

You can use group by and max on struct column to get the highest salary by department with the associated employee like this:

import pyspark.sql.functions as F

result = df.groupBy("department") \
    .agg(F.max(F.struct("salary", "employee")).alias("max")) \
    .selectExpr("max.employee", "department", "max.salary as highest_salary")

result.show()
# -------- ---------- -------------- 
#|employee|department|highest_salary|
# -------- ---------- -------------- 
#|   Raman|   Finance|         99000|
#|   Sarah| Marketing|         95000|
#|   James|     Sales|         90000|
# -------- ---------- --------------

CodePudding user response：

You can use the analytic functions rank or row_number.

df = df.withColumn('rank', F.expr('rank() over (partition by department order by salary desc)')) \
    .filter('rank=1').drop('rank')
df.show(truncate=False)