I have an initial dataframe
df1 =
--- --- ---
| A| B| C|
--- --- ---
| 1| 1| 10|
| 1| 2| 11|
| 1| 2| 12|
| 3| 1| 13|
| 2| 1| 14|
| 2| 1| 15|
| 2| 1| 16|
| 4| 1| 17|
| 4| 2| 18|
| 4| 3| 19|
| 4| 4| 19|
| 4| 5| 20|
| 4| 5| 20|
--- --- ---
Using pyspark I coded the dataframe with a window function using a sum taking into account the column 'A' and taking into account the column 'B' sorted.
spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',sum('C').over(spec))
df1.show()
--- --- --- -----
| A| B| C| D|
--- --- --- -----
| 1| 1| 10| 10.0|
| 1| 2| 11| 33.0|
| 1| 2| 12| 33.0|
| 2| 1| 14| 45.0|
| 2| 1| 15| 45.0|
| 2| 1| 16| 45.0|
| 3| 1| 13| 13.0|
| 4| 1| 17| 17.0|
| 4| 2| 18| 35.0|
| 4| 3| 19| 54.0|
| 4| 4| 19| 73.0|
| 4| 5| 20|113.0|
| 4| 5| 20|113.0|
--- --- --- -----
Is it possible to do the same calculation using Pandas Dataframe?
I tried using
df['D'] = df.sort_values(['A','B']).groupby(['A', 'B'])['C'].transform('cumsum')
but it is not the same result
Thanks
CodePudding user response:
In pandas we can groupby sum on A and B. Then groupby cumsum just over A. To add results back to the DataFrame rename to the new column name then join the results back to the DataFrame on the initial group keys ['A', 'B']:
df1 = df1.join(
df1.groupby(by=['A', 'B'])['C'].sum()
.groupby(level='A').cumsum()
.rename('D'),
on=['A', 'B']
)
df1:
A B C D
0 1 1 10 10
1 1 2 11 33
2 1 2 12 33
3 3 1 13 13
4 2 1 14 45
5 2 1 15 45
6 2 1 16 45
7 4 1 17 17
8 4 2 18 35
9 4 3 19 54
10 4 4 19 73
11 4 5 20 113
12 4 5 20 113
Setup:
import pandas as pd
df1 = pd.DataFrame({
'A': [1, 1, 1, 3, 2, 2, 2, 4, 4, 4, 4, 4, 4],
'B': [1, 2, 2, 1, 1, 1, 1, 1, 2, 3, 4, 5, 5],
'C': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19, 20, 20]
})
CodePudding user response:
You can sort, then cumsum within 'A', then groupby max within ['A', 'B'] groups using transform so you can assign the result back.
df1['D'] = (df1.sort_values(['A', 'B'])
.groupby('A')['C'].cumsum()
.groupby([df1['A'], df1['B']]).transform('max'))
A B C D
0 1 1 10 10
1 1 2 11 33
2 1 2 12 33
3 3 1 13 13
4 2 1 14 45
5 2 1 15 45
6 2 1 16 45
7 4 1 17 17
8 4 2 18 35
9 4 3 19 54
10 4 4 19 73
11 4 5 20 113
12 4 5 20 113
