Home > Back-end >  Fill new column in df based on many conditions in df2 and df3
Fill new column in df based on many conditions in df2 and df3

Time:01-07

I have three dfs. df1 contains 46 columns. df2 and df3 contain 41 columns and contain threshold values for columns in df1 which indicate if it needs a repeat or repeat with addition. Below are simplified examples

df1:

 |Name | A   | B   | C  |.......
------------------------------
0|ID1  | 10  | 2   | 3  |
1|ID2  | 400 | 1   | 6  |
2|ID3  | 7   | 8   | 9  |
3|ID4  | 12  | 300 | 55 | 
4|ID5  | 0   | 1   | 2  | 

df2:

       | A  |  B  | C |.......
------------------------------
Repeat | 10 | 2  | 50 |

df3:

                 | A    |  B   | C   |.......
------------------------------
Repeat w Addition| 100  | 200  | 500 |

what I'd like to do is create a new column in df1 with the values "Repeat", "Repeat with Addition" or "No" based on the following conditions

for each row if any value in cols A-C is greater than repeat addition = Repeat Addition for each row if any value in cols A-C is greater than repeat but less than Repeat Addition= Repeat else No

desired output:

df1:

 |Name | A   | B   | C  |.......|Repeat Required?|
--------------------------------------------------
0|ID1  | 10  | 2   | 3  |.......|     Repeat
1|ID2  | 400 | 1   | 6  |.......|     Repeat with Addition
2|ID3  | 7   | 8   | 9  |.......|     Repeat
3|ID4  | 12  | 300 | 55 |.......|     Repeat with Addition
4|ID5  | 0   | 1   | 2  |.......|     No

what I have so far:

I tried using a function with np.select to fill the column but it produces a bunch of No values when that is not correct

def repeat_required(df):
    conds = [df >= df3.loc["Repeat w Addition"], df>= df2.loc["Repeat"]]
    labels = ['Repeat with Dilution', 'Repeat']
    return np.select(conds, labels, default='No')

df1["Repeat Required?"]=""
df1["Repeat Required?"]=repeat_required(df1.iloc[:,4:-1]) #the first 4 columns contain strings 

CodePudding user response:

You're right that you want to use np.select, but the conditions you need to provide are Boolean Series that are the same length as df1. To do this, you need to compare with the rows in df2 and df3 as Series (so that it aligns on the columns) and then check if any of the values in df1 satisfy your conditions.

You can manually specify the columns to compare, as I do in a list below, or you can not specify anything and leverage the fact that pandas will automatically align for the .ge comparison so it would only consider overlapping columns for an any check.

import numpy as np

cols = ['A', 'B', 'C']

conds = [df1[cols].ge(df3[cols].loc['Repeat w Addition']).any(1), 
         df1[cols].ge(df2[cols].loc['Repeat']).any(1)]
choices = ['Repeat w Addition', 'Repeat']

df1['Repeat Required'] = np.select(conds, choices, default='No')

print(df1)
  Name    A    B   C    Repeat Required
0  ID1   10    2   3             Repeat
1  ID2  400    1   6  Repeat w Addition
2  ID3    7    8   9             Repeat
3  ID4   12  300  55  Repeat w Addition
4  ID5    0    1   2                 No
  •  Tags:  
  • Related