Assume I have data in the form (As a Pandas' Data Frame):
| Index | ID | Value | Div Factor | Weighted Sum |
|---|---|---|---|---|
| 1 | 1 | 2 | 1 | |
| 2 | 1 | 3 | 2 | |
| 3 | 2 | 6 | 1 | |
| 4 | 1 | 1 | 3 | |
| 5 | 2 | 3 | 2 | |
| 6 | 2 | 9 | 3 | |
| 7 | 2 | 8 | 4 | |
| 8 | 3 | 5 | 1 | |
| 9 | 3 | 6 | 2 | |
| 10 | 1 | 8 | 4 | |
| 11 | 3 | 2 | 3 | |
| 12 | 3 | 7 | 4 |
I want to calculate the column Weighted Sum as following (For the $i$ -th row):
- Look at all values from row 1 to
i. - Sum values by groups according to the
IDvalue of each row. So we haveksum values wherekis the number of uniqueIDvalue from the row 1 toi. - Divide each sum (There are
ksum values) by the number of elements in the group. - Sum those
kvalues and divide byk(The average of the averages).
For example, let's do rows 1, 7 and 12:
Row 1
For i = 1 we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.
Row 7
For i = 7 we have only 2 unique values of ID above it: 1 and 2.
For the group of ID = 1 we have: (1 3 2) / 3 = 2.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
Then the average of averages is (2 6.5) / 2 = 4.25.
Row 12
For i = 12 we have 3 unique ID values on the rows 1:12.
For the group of ID = 1 we have: (8 1 3 2) / 4 = 3.5.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
For the group of ID = 3 we have: (7 2 6 5) / 4 = 5.
Then the average of averages is (3.5 6.5 5) / 3 = 5.
It is easy to do with a loop but for a large number of rows it is inefficient.
Is there an efficient way to do so? Maybe using apply() or transform()?
Remark: The method should be feasible for the case of ~1e7 rows and ~1e6 unique ID's.
CodePudding user response:
This should do what you're asking:
df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
df['Weighted Sum'] = df2.mean(axis=1)
(Simplification of the last line based on comment by @Acccumulation.)
Output:
Index ID Value Div Factor Weighted Sum
0 1 1 2 1 2.000000
1 2 1 3 2 2.500000
2 3 2 6 1 4.250000
3 4 1 1 3 4.000000
4 5 2 3 2 3.250000
5 6 2 9 3 4.000000
6 7 2 8 4 4.250000
7 8 3 5 1 4.500000
8 9 3 6 2 4.666667
9 10 1 8 4 5.166667
10 11 3 2 3 4.777778
11 12 3 7 4 5.000000
Explanation:
- Use
unstack()to pivot, making uniqueIDcolumn values into a level of column labels with theValueentry of each row in the column matching theIDvalue of the original dataframe, and NaN in all other locations - Use
fillna(0)andcumsum()to replace each column with its cumulative sum - Use
notnull().astype(int)andcumsum()to replace each column with its cumulative count (where NaNs count as 0) - Take the ratio of the cumulative sum to the cumulative count; this is the cumulative average at each successive row of
Valuefor each uniqueID - For each row, take the mean of the non-NaN cumulative averages across all unique
IDvalues. This is the result described in the question.
Detailed intermediate results:
Index ID Value Div Factor
0 1 1 2 1
1 2 1 3 2
2 3 2 6 1
3 4 1 1 3
4 5 2 3 2
5 6 2 9 3
6 7 2 8 4
7 8 3 5 1
8 9 3 6 2
9 10 1 8 4
10 11 3 2 3
11 12 3 7 4
df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
Value
ID 1 2 3
0 2.0 NaN NaN
1 3.0 NaN NaN
2 NaN 6.0 NaN
3 1.0 NaN NaN
4 NaN 3.0 NaN
5 NaN 9.0 NaN
6 NaN 8.0 NaN
7 NaN NaN 5.0
8 NaN NaN 6.0
9 8.0 NaN NaN
10 NaN NaN 2.0
11 NaN NaN 7.0
df1.fillna(0).cumsum()
Value
ID 1 2 3
0 2.0 0.0 0.0
1 5.0 0.0 0.0
2 5.0 6.0 0.0
3 6.0 6.0 0.0
4 6.0 9.0 0.0
5 6.0 18.0 0.0
6 6.0 26.0 0.0
7 6.0 26.0 5.0
8 6.0 26.0 11.0
9 14.0 26.0 11.0
10 14.0 26.0 13.0
11 14.0 26.0 20.0
df1.notnull().astype(int).cumsum()
Value
ID 1 2 3
0 1 0 0
1 2 0 0
2 2 1 0
3 3 1 0
4 3 2 0
5 3 3 0
6 3 4 0
7 3 4 1
8 3 4 2
9 4 4 2
10 4 4 3
11 4 4 4
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
Value
ID 1 2 3
0 2.0 NaN NaN
1 2.5 NaN NaN
2 2.5 6.0 NaN
3 2.0 6.0 NaN
4 2.0 4.5 NaN
5 2.0 6.0 NaN
6 2.0 6.5 NaN
7 2.0 6.5 5.000000
8 2.0 6.5 5.500000
9 3.5 6.5 5.500000
10 3.5 6.5 4.333333
11 3.5 6.5 5.000000
result with df['Weighted Sum'] = df2.mean(axis=1)
Index ID Value Div Factor Weighted Sum
0 1 1 2 1 2.000000
1 2 1 3 2 2.500000
2 3 2 6 1 4.250000
3 4 1 1 3 4.000000
4 5 2 3 2 3.250000
5 6 2 9 3 4.000000
6 7 2 8 4 4.250000
7 8 3 5 1 4.500000
8 9 3 6 2 4.666667
9 10 1 8 4 5.166667
10 11 3 2 3 4.777778
11 12 3 7 4 5.000000
