I have tried to tackle this for quite some time, but haven't been able to get a pythonic way around it by using the built-in groupby and transform methods from pandas.
The goal is to group the data by columns ex_date and id, then within the groups identified, standardize the column called ref_value_1 against the value found in the row where df['calc_date'] == df['ex_date']
Here's a sample input:
df = pd.DataFrame({'calc_date': ['1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021'], 'ex_date': ['2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021'], 'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'ref_value_1': [1.5, 3.0, 4.5, 5.0, 10.0, 15.0, 15.0, 40.0, 60.0, 75.0, 100.0, 120.0], 'bins': [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3]})
which looks like:
| calc_date | ex_date | id | ref_value_1 | bins |
|---|---|---|---|---|
| 1/1/2021 | 2/1/2021 | 1 | 1.5 | 1 |
| 2/1/2021 | 2/1/2021 | 1 | 3 | 1 |
| 3/1/2021 | 2/1/2021 | 1 | 4.5 | 1 |
| 1/1/2021 | 2/1/2021 | 2 | 5 | 1 |
| 2/1/2021 | 2/1/2021 | 2 | 10 | 1 |
| 3/1/2021 | 2/1/2021 | 2 | 15 | 1 |
| 1/1/2021 | 2/1/2021 | 3 | 15 | 2 |
| 2/1/2021 | 2/1/2021 | 3 | 40 | 2 |
| 3/1/2021 | 2/1/2021 | 3 | 60 | 2 |
| 1/1/2021 | 2/1/2021 | 4 | 75 | 3 |
| 2/1/2021 | 2/1/2021 | 4 | 100 | 3 |
| 3/1/2021 | 2/1/2021 | 4 | 120 | 3 |
And expected output:
| calc_date | ex_date | id | ref_value_1 | bins | standardized_val |
|---|---|---|---|---|---|
| 1/1/2021 | 2/1/2021 | 1 | 1.5 | 1 | 0.5 |
| 2/1/2021 | 2/1/2021 | 1 | 3 | 1 | 1 |
| 3/1/2021 | 2/1/2021 | 1 | 4.5 | 1 | 1.5 |
| 1/1/2021 | 2/1/2021 | 2 | 5 | 1 | 0.5 |
| 2/1/2021 | 2/1/2021 | 2 | 10 | 1 | 1 |
| 3/1/2021 | 2/1/2021 | 2 | 15 | 1 | 1.5 |
| 1/1/2021 | 2/1/2021 | 3 | 15 | 2 | 0.375 |
| 2/1/2021 | 2/1/2021 | 3 | 40 | 2 | 1 |
| 3/1/2021 | 2/1/2021 | 3 | 60 | 2 | 1.5 |
| 1/1/2021 | 2/1/2021 | 4 | 75 | 3 | 0.75 |
| 2/1/2021 | 2/1/2021 | 4 | 100 | 3 | 1 |
| 3/1/2021 | 2/1/2021 | 4 | 120 | 3 | 1.2 |
CodePudding user response:
You can mask the non matching values and fill per group using groupby transform to get the reference. Then simply divide your data with the reference.
ref = df['ref_value_1'].where(df['calc_date'].eq(df['ex_date'])).groupby(df['id']).transform('first')
df['standardized_val'] = df['ref_value_1'].div(ref)
Output:
calc_date ex_date id ref_value_1 bins standardized_val
0 1/1/2021 2/1/2021 1 1.5 1 0.500
1 2/1/2021 2/1/2021 1 3.0 1 1.000
2 3/1/2021 2/1/2021 1 4.5 1 1.500
3 1/1/2021 2/1/2021 2 5.0 1 0.500
4 2/1/2021 2/1/2021 2 10.0 1 1.000
5 3/1/2021 2/1/2021 2 15.0 1 1.500
6 1/1/2021 2/1/2021 3 15.0 2 0.375
7 2/1/2021 2/1/2021 3 40.0 2 1.000
8 3/1/2021 2/1/2021 3 60.0 2 1.500
9 1/1/2021 2/1/2021 4 75.0 3 0.750
10 2/1/2021 2/1/2021 4 100.0 3 1.000
11 3/1/2021 2/1/2021 4 120.0 3 1.200
CodePudding user response:
You can apply a lambda function the groupby object where each 'ref_value_1' in each group is divided by the 'ref_value_1' where the corresponding 'calc_date'=='ex_date':
df['standardized_val'] = df.groupby(['ex_date','id']).apply(lambda x: x['ref_value_1'] / x.loc[x['calc_date']==x['ex_date'], 'ref_value_1'].iloc[0]).droplevel([0,1])
Output:
calc_date ex_date id ref_value_1 bins standardized_val
0 1/1/2021 2/1/2021 1 1.5 1 0.500
1 2/1/2021 2/1/2021 1 3.0 1 1.000
2 3/1/2021 2/1/2021 1 4.5 1 1.500
3 1/1/2021 2/1/2021 2 5.0 1 0.500
4 2/1/2021 2/1/2021 2 10.0 1 1.000
5 3/1/2021 2/1/2021 2 15.0 1 1.500
6 1/1/2021 2/1/2021 3 15.0 2 0.375
7 2/1/2021 2/1/2021 3 40.0 2 1.000
8 3/1/2021 2/1/2021 3 60.0 2 1.500
9 1/1/2021 2/1/2021 4 75.0 3 0.750
10 2/1/2021 2/1/2021 4 100.0 3 1.000
11 3/1/2021 2/1/2021 4 120.0 3 1.200
