Apply a Function per Row of Sub Groups of the Data **Above** the Current Row-CodePudding

Assume I have data in the form (As a Pandas' Data Frame):

Index	ID	Value	Div Factor
1	1	2	1
2	1	3	2
3	2	6	1
4	1	1	3
5	2	3	2
6	2	9	3
7	2	8	4
8	3	5	1
9	3	6	2
10	1	8	4
11	3	2	3
12	3	7	4

I want to calculate the column Weighted Sum as following (For the $i$ -th row):

Look at all values from row 1 to i.
Sum values by groups according to the ID value of each row. So we have k sum values where k is the number of unique ID value from the row 1 to i.
Divide each sum (There are k sum values) by the number of elements in the group.
Sum those k values and divide by k (The average of the averages).

For example, let's do rows 1, 7 and 12:

Row 1

For i = 1 we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.

Row 7

For i = 7 we have only 2 unique values of ID above it: 1 and 2.
For the group of ID = 1 we have: (1 3 2) / 3 = 2.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
Then the average of averages is (2 6.5) / 2 = 4.25.

Row 12

For i = 12 we have 3 unique ID values on the rows 1:12.
For the group of ID = 1 we have: (8 1 3 2) / 4 = 3.5.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
For the group of ID = 3 we have: (7 2 6 5) / 4 = 5.
Then the average of averages is (3.5 6.5 5) / 3 = 5.

It is easy to do with a loop but for a large number of rows it is inefficient. Is there an efficient way to do so? Maybe using apply() or transform()?

Remark: The method should be feasible for the case of ~1e7 rows and ~1e6 unique ID's.

CodePudding user response：

This should do what you're asking:

df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
df['Weighted Sum'] = df2.mean(axis=1)

(Simplification of the last line based on comment by @Acccumulation.)

Output:

    Index  ID  Value  Div Factor  Weighted Sum
0       1   1      2           1      2.000000
1       2   1      3           2      2.500000
2       3   2      6           1      4.250000
3       4   1      1           3      4.000000
4       5   2      3           2      3.250000
5       6   2      9           3      4.000000
6       7   2      8           4      4.250000
7       8   3      5           1      4.500000
8       9   3      6           2      4.666667
9      10   1      8           4      5.166667
10     11   3      2           3      4.777778
11     12   3      7           4      5.000000

Explanation:

Use unstack() to pivot, making unique ID column values into a level of column labels with the Value entry of each row in the column matching the ID value of the original dataframe, and NaN in all other locations
Use fillna(0) and cumsum() to replace each column with its cumulative sum
Use notnull().astype(int) and cumsum() to replace each column with its cumulative count (where NaNs count as 0)
Take the ratio of the cumulative sum to the cumulative count; this is the cumulative average at each successive row of Value for each unique ID
For each row, take the mean of the non-NaN cumulative averages across all unique ID values. This is the result described in the question.

Detailed intermediate results:

    Index  ID  Value  Div Factor
0       1   1      2           1
1       2   1      3           2
2       3   2      6           1
3       4   1      1           3
4       5   2      3           2
5       6   2      9           3
6       7   2      8           4
7       8   3      5           1
8       9   3      6           2
9      10   1      8           4
10     11   3      2           3
11     12   3      7           4
df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
   Value
ID     1    2    3
0    2.0  NaN  NaN
1    3.0  NaN  NaN
2    NaN  6.0  NaN
3    1.0  NaN  NaN
4    NaN  3.0  NaN
5    NaN  9.0  NaN
6    NaN  8.0  NaN
7    NaN  NaN  5.0
8    NaN  NaN  6.0
9    8.0  NaN  NaN
10   NaN  NaN  2.0
11   NaN  NaN  7.0
df1.fillna(0).cumsum()
   Value
ID     1     2     3
0    2.0   0.0   0.0
1    5.0   0.0   0.0
2    5.0   6.0   0.0
3    6.0   6.0   0.0
4    6.0   9.0   0.0
5    6.0  18.0   0.0
6    6.0  26.0   0.0
7    6.0  26.0   5.0
8    6.0  26.0  11.0
9   14.0  26.0  11.0
10  14.0  26.0  13.0
11  14.0  26.0  20.0
df1.notnull().astype(int).cumsum()
   Value
ID     1  2  3
0      1  0  0
1      2  0  0
2      2  1  0
3      3  1  0
4      3  2  0
5      3  3  0
6      3  4  0
7      3  4  1
8      3  4  2
9      4  4  2
10     4  4  3
11     4  4  4
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
   Value
ID     1    2         3
0    2.0  NaN       NaN
1    2.5  NaN       NaN
2    2.5  6.0       NaN
3    2.0  6.0       NaN
4    2.0  4.5       NaN
5    2.0  6.0       NaN
6    2.0  6.5       NaN
7    2.0  6.5  5.000000
8    2.0  6.5  5.500000
9    3.5  6.5  5.500000
10   3.5  6.5  4.333333
11   3.5  6.5  5.000000
result with df['Weighted Sum'] = df2.mean(axis=1)
    Index  ID  Value  Div Factor  Weighted Sum
0       1   1      2           1      2.000000
1       2   1      3           2      2.500000
2       3   2      6           1      4.250000
3       4   1      1           3      4.000000
4       5   2      3           2      3.250000
5       6   2      9           3      4.000000
6       7   2      8           4      4.250000
7       8   3      5           1      4.500000
8       9   3      6           2      4.666667
9      10   1      8           4      5.166667
10     11   3      2           3      4.777778
11     12   3      7           4      5.000000