Assume that I have this DataFrame (Animals column is of type pandas.Series):
| ID | Animals |
|---|---|
| 1 | [cat, dog, chicken] |
| 2 | [penguin] |
And these lists (It can be NumPy Array or Pandas Series if it is better for performance):
mammals = ['cat', 'dog', 'cow', 'sheep']
birds = ['chicken', 'duck', 'penguin']
What I am trying to do is to add two columns to my DataFrame which are ContainsBirds and ContainsMammals based on the contents of the Animals column.
Here is the final expected output:
| ID | Animals | ContainsBirds | ContainsMammals |
|---|---|---|---|
| 1 | [cat, dog, chicken] | 1.0 | 1.0 |
| 2 | [penguin] | 1.0 | 0.0 |
CodePudding user response:
You can create dictionary for test if match at least one value by converting to sets with isdisjoint and if necessary 0.0 and 1.0 casting boolean to floats, for 0, 1 use .astype(int):
d = {'Birds':birds, 'Mammals':mammals}
for k, v in d.items():
df[f'Contains{k}'] = (~df['Animals'].map(set(v).isdisjoint)).astype(float)
print (df)
ID Animals ContainsBirds ContainsMammals
0 1 [cat, dog, chicken] 1.0 1.0
1 2 [penguin] 1.0 0.0
CodePudding user response:
Using a list comprehension:
lists = [birds, mammals]
names = ['Birds', 'Mammals']
df[names] = [[int(bool(set(l).intersection(x))) for l in lists]
for x in df['Animals']]
output:
ID Animals Birds Mammals
0 1 [cat, dog, chicken] 1 1
1 2 [penguin] 1 0
