I have three dfs. df1 contains 46 columns. df2 and df3 contain 41 columns and contain threshold values for columns in df1 which indicate if it needs a repeat or repeat with addition. Below are simplified examples
df1:
|Name | A | B | C |.......
------------------------------
0|ID1 | 10 | 2 | 3 |
1|ID2 | 400 | 1 | 6 |
2|ID3 | 7 | 8 | 9 |
3|ID4 | 12 | 300 | 55 |
4|ID5 | 0 | 1 | 2 |
df2:
| A | B | C |.......
------------------------------
Repeat | 10 | 2 | 50 |
df3:
| A | B | C |.......
------------------------------
Repeat w Addition| 100 | 200 | 500 |
what I'd like to do is create a new column in df1 with the values "Repeat", "Repeat with Addition" or "No" based on the following conditions
for each row if any value in cols A-C is greater than repeat addition = Repeat Addition for each row if any value in cols A-C is greater than repeat but less than Repeat Addition= Repeat else No
desired output:
df1:
|Name | A | B | C |.......|Repeat Required?|
--------------------------------------------------
0|ID1 | 10 | 2 | 3 |.......| Repeat
1|ID2 | 400 | 1 | 6 |.......| Repeat with Addition
2|ID3 | 7 | 8 | 9 |.......| Repeat
3|ID4 | 12 | 300 | 55 |.......| Repeat with Addition
4|ID5 | 0 | 1 | 2 |.......| No
what I have so far:
I tried using a function with np.select to fill the column but it produces a bunch of No values when that is not correct
def repeat_required(df):
conds = [df >= df3.loc["Repeat w Addition"], df>= df2.loc["Repeat"]]
labels = ['Repeat with Dilution', 'Repeat']
return np.select(conds, labels, default='No')
df1["Repeat Required?"]=""
df1["Repeat Required?"]=repeat_required(df1.iloc[:,4:-1]) #the first 4 columns contain strings
CodePudding user response:
You're right that you want to use np.select, but the conditions you need to provide are Boolean Series that are the same length as df1. To do this, you need to compare with the rows in df2 and df3 as Series (so that it aligns on the columns) and then check if any of the values in df1 satisfy your conditions.
You can manually specify the columns to compare, as I do in a list below, or you can not specify anything and leverage the fact that pandas will automatically align for the .ge comparison so it would only consider overlapping columns for an any check.
import numpy as np
cols = ['A', 'B', 'C']
conds = [df1[cols].ge(df3[cols].loc['Repeat w Addition']).any(1),
df1[cols].ge(df2[cols].loc['Repeat']).any(1)]
choices = ['Repeat w Addition', 'Repeat']
df1['Repeat Required'] = np.select(conds, choices, default='No')
print(df1)
Name A B C Repeat Required
0 ID1 10 2 3 Repeat
1 ID2 400 1 6 Repeat w Addition
2 ID3 7 8 9 Repeat
3 ID4 12 300 55 Repeat w Addition
4 ID5 0 1 2 No
