I have an example of pyspark dataframe:
| X | Y | Z | DATE |
|---|---|---|---|
| 23 | 41 | 63 | 2016-01-01 |
| 23 | 41 | 5 | 2016-01-01 |
| 23 | 41 | 75 | 2016-01-01 |
| 23 | 41 | 46 | 2016-12-01 |
| 23 | 41 | 23 | 2016-12-01 |
| 27 | 41 | 5 | 2016-01-01 |
| 27 | 41 | 75 | 2016-01-01 |
| 27 | 41 | 85 | 2016-01-01 |
| 27 | 41 | 71 | 2016-01-01 |
What I want is to count rows with the same X, Y and DATE columns and store the value in a new column.
Final dataframe should looks like this:
| X | Y | Z | DATE | SUM |
|---|---|---|---|---|
| 23 | 41 | 63 | 2016-01-01 | 3 |
| 23 | 41 | 5 | 2016-01-01 | 3 |
| 23 | 41 | 75 | 2016-01-01 | 3 |
| 23 | 41 | 46 | 2016-12-01 | 2 |
| 23 | 41 | 23 | 2016-12-01 | 2 |
| 27 | 41 | 5 | 2016-01-01 | 4 |
| 27 | 41 | 75 | 2016-01-01 | 4 |
| 27 | 41 | 85 | 2016-01-01 | 4 |
| 27 | 41 | 71 | 2016-01-01 | 4 |
CodePudding user response:
This might help (assuming you mistyped and wanted Count instead of sum):
from pyspark.sql.functions import count
df = df.withColumn("Count", count("*").over(Window.partitionBy("X", "Y", "Date")))
