I have a pyspark dataframe with this cols and values:
|employee|department|salary|
-------- ---------- ------
| James| Sales| 90000|
| Michael| Sales| 86000|
| Robert| Sales| 81000|
| Maria| Finance| 90000|
| Raman| Finance| 99000|
| Scott| Finance| 83000|
| Jeff| Marketing| 80000|
| Kumar| Marketing| 91000|
| Sarah| Marketing| 95000|
I want to have a dataframe with the employee with the highest salary for each department.
So my result will be :
|employee|department|highest salary|
-------- ---------- ---------------
| James| Sales| 90000|
| Raman| Finance| 99000|
| Sarah| Marketing| 95000|
I already tried multiple things with groupby but I don't know what to do.
Thank you for your help.
CodePudding user response:
You can use group by and max on struct column to get the highest salary by department with the associated employee like this:
import pyspark.sql.functions as F
result = df.groupBy("department") \
.agg(F.max(F.struct("salary", "employee")).alias("max")) \
.selectExpr("max.employee", "department", "max.salary as highest_salary")
result.show()
# -------- ---------- --------------
#|employee|department|highest_salary|
# -------- ---------- --------------
#| Raman| Finance| 99000|
#| Sarah| Marketing| 95000|
#| James| Sales| 90000|
# -------- ---------- --------------
CodePudding user response:
You can use the analytic functions rank or row_number.
df = df.withColumn('rank', F.expr('rank() over (partition by department order by salary desc)')) \
.filter('rank=1').drop('rank')
df.show(truncate=False)
