Create lambda function to apply to select df columns-CodePudding

I have the following df:

id   header1     header2      diabetes obesity hypertension/high blood pressure. . .      
 1  metabolism   diabetes          no      no          no
 2  heart issue  heart disease    None     None        None       
 3    obesity    diabetes          yes     no          no
 4   metabolism  had hypertension  no      no          yes
 5   heart issue heart disease     no      no          yes
 6    obesity    diabetes          yes     yes         no
 7    obesity    diabetes          no      no          yes

I want to create a lambda function that iterates through header1 and header2, checks if either cell is a substring of the column names. Depending on whether the column has yes, no, or null, return a column with a flag value.

For every cell in header1 or header2, if it contains a substring match in the column name and there is a yes within that column, flag the new column as 2. If any of the category columns contains a yes, but not a keyword match with header1 and header2, put a 1. Else, leave blank!

Example)

attempt: cols = [x for x in df.columns if x not in ['header1', 'header2']]

df['flag'] = df.apply(lambda x: 2 if df['header1'] or df['header2'] in cols and cols == yes, 1 elif df['header1'] not in df['header2'] in cols and cols == yes, None else

desired result:

id   header1     header2    diabetes  obesity hypertension/high blood pressure | flag      
 1  metabolism   diabetes         no      no            no                       None                  
 2  heart issue  heart disease  None      None         None                      None
 3    obesity    diabetes         yes     no            no                        2
 4   metabolism had hypertension  no      no            yes                       2
 5   heart issue heart disease    no      no            yes                       1
 6    obesity    diabetes         yes     yes           no                        2
 7    obesity    diabetes          no      no          yes                        1

Constructor

Please note that my actual df has a dynamic amount of yes/no columns, but only two header columns.

data = np.array([('metabolism','diabetes','no','no', 'no'), 
                 ('heart issue', 'heart disease', None,None,None),
                 ('obesity','diabetes','yes','no','no'),
                 ('metabolism',' had hypertension','no','no','yes'),
                 ('heart issue', 'heart disease','no','no','yes'),
                 ('obesity', 'diabetes','yes','yes', 'no'),
                 ('obesity', 'diabetes', 'no','no', 'yes')])


df = pd.DataFrame(data, columns=['header1', 'header2','diabetes','obesity','hypertension/high blood pressure'])

cols = [x for x in df.columns if x not in ['header1', 'header2']]

CodePudding user response：

First create disease column index and disease names series (the latter is used to capture "hypertension").

Then simply apply a function that first counts the "yes" answers and searches for disease names among the "yes" answers

headers = ['header1', 'header2']
disease_cols = df.columns.difference(headers)
disease_names = disease_cols.str.split('/').str[0]

def get_flag(row):
    yes = row[disease_cols].eq('yes')
    if sum(yes) > 0:
        return 2 if row[headers].str.contains('|'.join(disease_names[yes])).any() else 1
    else:
        return np.nan


df['flag'] = df.apply(get_flag, axis=1)

Output:

       header1        header2 diabetes obesity hypertension/high blood pressure   flag
0   metabolism       diabetes       no      no                       no           NaN
1  heart issue  heart disease       no      no                       no           NaN
2      obesity       diabetes      yes      no                       no           2.0
3   metabolism   hypertension       no      no                      yes           2.0
4  heart issue  heart disease       no      no                      yes           1.0
5      obesity       diabetes      yes     yes                       no           2.0
6      obesity       diabetes       no      no                      yes           1.0