I have a pandas DataFrame structured as follows:
ID Class
0 2431214 16,41,9
1 2497796 14,16,18,20,24,35,41,42
2 1407550 20,21,24,25,26,27,28,3,34,35,4,8
3 472723 1,17,22
4 423046 28,30,32,34,39,4,42,8
Class is a string attribute. I want to filter rows where Class contains at least one of the elements in a given list, for example, if my filter is l = [16, 27], I should get the following:
ID Class
0 2431214 16,41,9
1 2497796 14,16,18,20,24,35,41,42
2 1407550 20,21,24,25,26,27,28,3,34,35,4,8
This behaviour I'm looking for is similar to SQL's IN statement, which works fine. However, pandas' isin is not working for me (it's returning entries with only exactly one of the filter classes):
l = [16, 27]
filtered = df.loc[df['Class'].isin(l)]
ID Class
210 30359 16
945 46307 27
3641 81222 27
3817 77365 27
4185 89828 16
How can I filter my DataFrame so that entries with at least one value satisfying the filter condition are returned?
CodePudding user response:
As you have strings, the most optimal is likely to use a regex with word boundaries:
pattern = '|'.join(map(str, l))
out = df[df['Class'].str.contains(fr'\b(?:{pattern})\b')]
Output:
ID Class
0 2431214 16,41,9
1 2497796 14,16,18,20,24,35,41,42
2 1407550 20,21,24,25,26,27,28,3,34,35,4,8
If you want to know which values were detected:
# first match
df['Class'].str.extract(fr'\b({pattern})\b', expand=False)
# all matches
df['Class'].str.extractall(fr'\b({pattern})\b')[0].groupby(level=0).agg(','.join)
