I'm trying to subset (retrieve a set of rows) a python pandas data frame by using pd.filter with a regex string to identify the columns of interest before performing a subset based on the values in those columns.
For example, this is my mock data frame:
id status status_drug_use drugA drugA_use drugB drugB_use
0 1 analgesic 0 None 1 hypertensive
1 0 analgesic 1 analgesic 1 hypertensive
2 0 analgesic 1 hypertensive 0 None
3 1 analgesic 0 None 1 analgesic
I would like all rows that contain the values in columns drugA_use or drugB_use which match the value in status_drug_use. As per the example, this would return the two rows:
id status status_drug_use drugA drugA_use drugB drugB_use
1 0 analgesic 1 analgesic 1 hypertensive
3 1 analgesic 0 None 1 analgesic
There are a few column name conventions to stick with:
status_drug_useis always there.- The matching columns (
drugA_useanddrugB_use) always follow the template<ANYTHING>_use.
Alteration
There is a second scenario, one in which I would like to perform a comparison between a user defined string eg analgesic and the two columns drugA_use and drugB_use. This is different from using the content of status_drug_use.
CodePudding user response:
Here's a way to do what you've asked:
df2 = df.assign(all_use=df.apply(
lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]),
axis=1)).explode(
'all_use').query('status_drug_use == all_use').drop_duplicates().drop(columns='all_use')
Input:
id status status_drug_use drugA drugA_use drugB drugB_use
0 0 1 analgesic 0 None 1 hypertensive
1 1 0 analgesic 1 analgesic 1 hypertensive
2 2 0 analgesic 1 hypertensive 0 None
3 3 1 analgesic 0 None 1 analgesic
Output:
id status status_drug_use drugA drugA_use drugB drugB_use
1 1 0 analgesic 1 analgesic 1 hypertensive
3 3 1 analgesic 0 None 1 analgesic
Explanation:
- find the subset of all columns ending in
_use(excludingstatus_drug_use) - add a column named
all_usewhose value for a given row is a list of the values in the columns ending in_use - use
explode()to add rows such that for each original row, there are now multiple rows, one for each of the values inall_usefor the original row - use
query()to select only rows wherestatus_drug_usematches the value inall_use - use
drop_duplicatesto eliminate rows in case there were multiple matches for any rows in the original dataframe (for example, if bothdrugA_useanddrugB_usecontained "analgesic" and so didstatus_drug_use) - drop the column
all_useas we no longer need it.
UPDATE: Addressing OP's question in a comment: 'Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string e.g., "analgesic"?'
You can do this by having the user defined query string (call it user_defined_str) as a variable and changing the contents of query() by replacing the column name status_drug_use with the variable name with @ prepended: @user_defined_str (see the query() docs here for more detail).
user_defined_str = 'analgesic'
df3 = df.assign(all_use=df.apply(
lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]),
axis=1)).explode(
'all_use').query('@user_defined_str == all_use').drop_duplicates().drop(columns='all_use')
