Pandas - Groupby and Standardize-CodePudding

I have tried to tackle this for quite some time, but haven't been able to get a pythonic way around it by using the built-in groupby and transform methods from pandas.

The goal is to group the data by columns ex_date and id, then within the groups identified, standardize the column called ref_value_1 against the value found in the row where df['calc_date'] == df['ex_date']

Here's a sample input:

df = pd.DataFrame({'calc_date': ['1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021'], 'ex_date': ['2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021'], 'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'ref_value_1': [1.5, 3.0, 4.5, 5.0, 10.0, 15.0, 15.0, 40.0, 60.0, 75.0, 100.0, 120.0], 'bins': [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3]})

which looks like:

calc_date	ex_date	id	ref_value_1	bins
1/1/2021	2/1/2021	1	1.5	1
2/1/2021	2/1/2021	1	3	1
3/1/2021	2/1/2021	1	4.5	1
1/1/2021	2/1/2021	2	5	1
2/1/2021	2/1/2021	2	10	1
3/1/2021	2/1/2021	2	15	1
1/1/2021	2/1/2021	3	15	2
2/1/2021	2/1/2021	3	40	2
3/1/2021	2/1/2021	3	60	2
1/1/2021	2/1/2021	4	75	3
2/1/2021	2/1/2021	4	100	3
3/1/2021	2/1/2021	4	120	3

And expected output:

calc_date	ex_date	id	ref_value_1	bins	standardized_val
1/1/2021	2/1/2021	1	1.5	1	0.5
2/1/2021	2/1/2021	1	3	1	1
3/1/2021	2/1/2021	1	4.5	1	1.5
1/1/2021	2/1/2021	2	5	1	0.5
2/1/2021	2/1/2021	2	10	1	1
3/1/2021	2/1/2021	2	15	1	1.5
1/1/2021	2/1/2021	3	15	2	0.375
2/1/2021	2/1/2021	3	40	2	1
3/1/2021	2/1/2021	3	60	2	1.5
1/1/2021	2/1/2021	4	75	3	0.75
2/1/2021	2/1/2021	4	100	3	1
3/1/2021	2/1/2021	4	120	3	1.2

CodePudding user response：

You can mask the non matching values and fill per group using groupby transform to get the reference. Then simply divide your data with the reference.

ref = df['ref_value_1'].where(df['calc_date'].eq(df['ex_date'])).groupby(df['id']).transform('first')

df['standardized_val'] = df['ref_value_1'].div(ref)

Output:

   calc_date   ex_date  id  ref_value_1  bins  standardized_val
0   1/1/2021  2/1/2021   1          1.5     1             0.500
1   2/1/2021  2/1/2021   1          3.0     1             1.000
2   3/1/2021  2/1/2021   1          4.5     1             1.500
3   1/1/2021  2/1/2021   2          5.0     1             0.500
4   2/1/2021  2/1/2021   2         10.0     1             1.000
5   3/1/2021  2/1/2021   2         15.0     1             1.500
6   1/1/2021  2/1/2021   3         15.0     2             0.375
7   2/1/2021  2/1/2021   3         40.0     2             1.000
8   3/1/2021  2/1/2021   3         60.0     2             1.500
9   1/1/2021  2/1/2021   4         75.0     3             0.750
10  2/1/2021  2/1/2021   4        100.0     3             1.000
11  3/1/2021  2/1/2021   4        120.0     3             1.200

CodePudding user response：

You can apply a lambda function the groupby object where each 'ref_value_1' in each group is divided by the 'ref_value_1' where the corresponding 'calc_date'=='ex_date':

df['standardized_val'] = df.groupby(['ex_date','id']).apply(lambda x: x['ref_value_1'] / x.loc[x['calc_date']==x['ex_date'], 'ref_value_1'].iloc[0]).droplevel([0,1])

Output:

   calc_date   ex_date  id  ref_value_1  bins  standardized_val
0   1/1/2021  2/1/2021   1          1.5     1             0.500
1   2/1/2021  2/1/2021   1          3.0     1             1.000
2   3/1/2021  2/1/2021   1          4.5     1             1.500
3   1/1/2021  2/1/2021   2          5.0     1             0.500
4   2/1/2021  2/1/2021   2         10.0     1             1.000
5   3/1/2021  2/1/2021   2         15.0     1             1.500
6   1/1/2021  2/1/2021   3         15.0     2             0.375
7   2/1/2021  2/1/2021   3         40.0     2             1.000
8   3/1/2021  2/1/2021   3         60.0     2             1.500
9   1/1/2021  2/1/2021   4         75.0     3             0.750
10  2/1/2021  2/1/2021   4        100.0     3             1.000
11  3/1/2021  2/1/2021   4        120.0     3             1.200