How to compute the number of performance such that performance = P<=5 ; P>5 & P<=15 ; P>15
| address | performance = P |
|---|---|
| NACELLES | 589 |
| NACELLES | 0 |
| NACELLES | 48 |
| NACELLES | 318 |
| NACELLES | 378 |
| NACELLES | 52 |
| NACELLES | 45 |
| NACELLES | 201 |
| NACELLES | 416 |
| NACELLES | 29 |
| NACELLES | 183 |
| NACELLES | 53 |
| NACELLES | 7 |
| NACELLES | 127 |
| NACELLES | 157 |
| NACELLES | 248 |
| NACELLES | 10 |
| NACELLES | 317 |
| NACELLES | 2 |
| NACELLES | 4 |
We obtain this dataset
| address | P<=5 | P>5 & P<=15 | P> 15 |
|---|---|---|---|
| NACELLES | 15 % | 10 % | 75 % |
CodePudding user response:
using your dataframe as an example :
-------- -----------
| address|performance|
-------- -----------
|NACELLES| 589|
|NACELLES| 0|
|NACELLES| 48|
|NACELLES| 318|
You simply have to aggregate and sum using a when function :
df.groupBy("address").agg(
(F.sum(F.when(F.col("performance") <= 5, 1)) / F.count("*")).alias("P<=5"),
(
F.sum(F.when((F.col("performance") > 5) & (F.col("performance") <= 15), 1))
/ F.count("*")
).alias("P>5 & P<=15"),
(F.sum(F.when(F.col("performance") > 15, 1)) / F.count("*")).alias("P>15"),
).show()
-------- ---- ----------- ----
| address|P<=5|P>5 & P<=15|P>15|
-------- ---- ----------- ----
|NACELLES|0.15| 0.1|0.75|
-------- ---- ----------- ----
