I'm having a dataframe with the below details and I need to iterate over each row to create the below formatted nested json string.
| Employee name | Salary | sick_leave_day | paid_leave_day |
|---|---|---|---|
| Karthi | 20000 | 10-10-2021 | Y |
| Karthi | 20000 | 11-12-2021 | Y |
| Karthi | 20000 | 13-12-2021 | N |
| Rajesh | 25000 | 15-12-2021 | Y |
| Rajesh | 25000 | 17-11-2021 | N |
| Rajesh | 25000 | 10-10-2021 | Y |
The output json should look like below, Here, Name and salary is having unique data and sick_leave_day,paid_leave_day changes for each row.
{
"Name": "Karthi",
"Salary": "20000",
leave_details: [{
"sick_leave_day": "10-10-2021",
"paid_leave_day ": "Y"
},
{
"sick_leave_day": "11-12-2021",
"paid_leave_day ": "Y"
},
{
"sick_leave_day": "13-12-2021",
"paid_leave_day ": "N"
}
]
}
I have tried to iterate over dataframe using foreach method, but I'm unable to get the logic. If I group the name and salary, I can't get the value of name and salary only once. So, please help me to write the logic by iterating over each row to create this logic. Thanks in advance.
CodePudding user response:
You can use the collect_list aggregation function to group by Employee and get the list of leave days.
For instance you can manually create a JSON with the 2 "leave_day" columns then aggregate using collect_list:
var result = df.withColumn("new_col", concat(lit("{ sick_leave_day:"), df.col("sick_leave_day"), lit(", paid_leave_day:"), df.col("paid_leave_day"), lit("}")))
.groupBy("Employee name", "Salary")
.agg(collect_list("new_col"))
result.show(truncate=false)
------------- ------ ------------------------------------------------------------------------------------------------------------------------------------------------
|Employee name|Salary|collect_list(new_col) |
------------- ------ ------------------------------------------------------------------------------------------------------------------------------------------------
|Rajesh |25000 |[{ sick_leave_day:15-12-2021, paid_leave_day:Y}, { sick_leave_day:17-11-2021, paid_leave_day:N}, { sick_leave_day:10-10-2021, paid_leave_day:Y}]|
|Karthi |20000 |[{ sick_leave_day:10-10-2021, paid_leave_day:Y}, { sick_leave_day:11-12-2021, paid_leave_day:Y}, { sick_leave_day:13-12-2021, paid_leave_day:N}]|
------------- ------ ------------------------------------------------------------------------------------------------------------------------------------------------
CodePudding user response:
Group by Employee name and collect list of structs for leave_details array and first (or max) Salary then save a json:
val result = df.groupBy(col("Employee name").as("name")).agg(
first($"Salary").as("Salary"),
collect_list(struct($"sick_leave_day", $"paid_leave_day")).as("leave_details")
)
result.toJSON.collect.foreach(println(_))
//{"name":"Rajesh","Salary":"25000","leave_details":[{"sick_leave_day":"15-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"17-11-2021","paid_leave_day":"N"},{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"}]}
//{"name":"Karthi","Salary":"20000","leave_details":[{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"},{"sick_leave_day":"11-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"13-12-2021","paid_leave_day":"N"}]}
