I have a data frame grouped by multiple columns but in this example it would be grouped only by Year.
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse
2 2002 Lion
3 2002 Duck
4 2010 Dog Cat
5 2010 Cat
6 2010 Lion
7 2010 Mouse
I would like for each group, from the rows where Animal2 is empty to filter out the rows where Animal2 does not appear in the column Animal1.
The expected output would be:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse
2 2002 Lion
3 2010 Dog Cat
4 2010 Cat
Rows 0 & 3 stayed since Animal2 is not empty.
Rows 1 & 2 stayed since Mouse & Lion are in Animal2 for the first group.
Row 4 stayed since cat appear in Animal2 for the second group
CodePudding user response:
You can use masks and regexes:
# non empty Animal2
m1 = df['Animal2'].notna()
# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')
# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
.apply(lambda g: g.str.fullmatch(patterns[g.name]))
)
out = df.loc[m1|m2]
Or sets:
m1 = df['Animal2'].notna()
sets = (df.loc[m1, 'Animal2'].str.split(',')
.groupby(df['Year'])
.agg(lambda x: set().union(*x))
)
m2 = (df.groupby('Year', group_keys=False)['Animal1']
.apply(lambda g: g.isin(sets[g.name]))
)
out = df.loc[m1|m2]
Output:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse None
2 2002 Lion None
4 2010 Dog Cat
5 2010 Cat None
CodePudding user response:
Here is a solution using list comprehension
(df.loc[
[a1 in a2 for a1,a2 in zip(df['Animal1'],df['Year'].map(df['Animal2'].str.split(',').groupby(df['Year']).sum()))] |
df['Animal2'].notna()]
)
or
d = df['Animal2'].str.split(',').groupby(df['Year']).sum()
(df.loc[df.groupby('Year')['Animal1'].transform(lambda x: x.isin(d.loc[x.name])) |
df['Animal2'].notna()]
)
Output:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse None
2 2002 Lion None
4 2010 Dog Cat
5 2010 Cat None
