I have the following df:
id header1 header2 diabetes obesity hypertension/high blood pressure. . .
1 metabolism diabetes no no no
2 heart issue heart disease None None None
3 obesity diabetes yes no no
4 metabolism had hypertension no no yes
5 heart issue heart disease no no yes
6 obesity diabetes yes yes no
7 obesity diabetes no no yes
I want to create a lambda function that iterates through header1 and header2, checks if either cell is a substring of the column names. Depending on whether the column has yes, no, or null, return a column with a flag value.
For every cell in header1 or header2, if it contains a substring match in the column name and there is a yes within that column, flag the new column as 2. If any of the category columns contains a yes, but not a keyword match with header1 and header2, put a 1. Else, leave blank!
Example)
attempt: cols = [x for x in df.columns if x not in ['header1', 'header2']]
df['flag'] = df.apply(lambda x: 2 if df['header1'] or df['header2'] in cols and cols == yes, 1 elif df['header1'] not in df['header2'] in cols and cols == yes, None else
desired result:
id header1 header2 diabetes obesity hypertension/high blood pressure | flag
1 metabolism diabetes no no no None
2 heart issue heart disease None None None None
3 obesity diabetes yes no no 2
4 metabolism had hypertension no no yes 2
5 heart issue heart disease no no yes 1
6 obesity diabetes yes yes no 2
7 obesity diabetes no no yes 1
Constructor
Please note that my actual df has a dynamic amount of yes/no columns, but only two header columns.
data = np.array([('metabolism','diabetes','no','no', 'no'),
('heart issue', 'heart disease', None,None,None),
('obesity','diabetes','yes','no','no'),
('metabolism',' had hypertension','no','no','yes'),
('heart issue', 'heart disease','no','no','yes'),
('obesity', 'diabetes','yes','yes', 'no'),
('obesity', 'diabetes', 'no','no', 'yes')])
df = pd.DataFrame(data, columns=['header1', 'header2','diabetes','obesity','hypertension/high blood pressure'])
cols = [x for x in df.columns if x not in ['header1', 'header2']]
CodePudding user response:
First create disease column index and disease names series (the latter is used to capture "hypertension").
Then simply apply a function that first counts the "yes" answers and searches for disease names among the "yes" answers
headers = ['header1', 'header2']
disease_cols = df.columns.difference(headers)
disease_names = disease_cols.str.split('/').str[0]
def get_flag(row):
yes = row[disease_cols].eq('yes')
if sum(yes) > 0:
return 2 if row[headers].str.contains('|'.join(disease_names[yes])).any() else 1
else:
return np.nan
df['flag'] = df.apply(get_flag, axis=1)
Output:
header1 header2 diabetes obesity hypertension/high blood pressure flag
0 metabolism diabetes no no no NaN
1 heart issue heart disease no no no NaN
2 obesity diabetes yes no no 2.0
3 metabolism hypertension no no yes 2.0
4 heart issue heart disease no no yes 1.0
5 obesity diabetes yes yes no 2.0
6 obesity diabetes no no yes 1.0
