I have a pandas dataframe with 100 columns.
df.columns = ['casette', 'a true', 'b true', 'blah1', 'blah2', ..... 'a card', 'b card']
is there a better way to filter columns based on a list substrings
eg. something along the lines of:
df = df[[x for x in df.columns if any(['true', 'cassette', 'card'] not in x)]]
instead of:
df = df[[x for x in df.columns if 'true' not in x and 'cassette' not in x and 'card' not in x]]
to get:
df.columns = ['blah1', 'blah2', .....]
any tips would be appreciated cheers
CodePudding user response:
You should use filter. Something like:
target_list = ['true', 'cassete', 'card']
df.filter(regex="|".join(target_list), axis=1)
CodePudding user response:
You can write it using any()
>>> [ x for x in df.columns if not any(s in x for s in ['true', 'cassette', 'card']) ]
['blah1', 'blah2']
or all()
>>> [ x for x in df.columns if all(s not in x for s in ['true', 'cassette', 'card']) ]
['blah1', 'blah2']
However - you don't save much over your original effort.
You can .filter() the columns by regex
>>> df.filter(regex='^(?:(?!cassette|true|card).) $').columns
Index(['blah1', 'blah2'], dtype='object')
CodePudding user response:
You can also use str.contains to filter out column names:
unwanted_words = ['true', 'cassette', 'card']
out = df.columns[~df.columns.str.contains('|'.join(unwanted_words))]
Output:
Index(['blah1', 'blah2'], dtype='object')
