I have the following dataframe
id date other variables..
A 2019Q4
A 2020Q4
A 2021Q4
B 2018Q4
B 2019Q4
B 2020Q4
B 2021Q4
C 2020Q4
C 2021Q4
D 2021Q4
E 2018Q4
E 2019Q4
E 2020Q4
E 2021Q4
. .
I want to group by id and keep those ids if it contains all of the designated values (i.e. 2019Q4, 2020Q4, 2021Q4) then extract rows that correspond to those values. isin() won't work because it won't drop C and D.
desired output
A 2019Q4
A 2020Q4
A 2021Q4
B 2019Q4
B 2020Q4
B 2021Q4
E 2019Q4
E 2020Q4
E 2021Q4
. .
CodePudding user response:
You can use set operations to filter the id and isin for the date:
target = {'2019Q4', '2020Q4', '2021Q4'}
id_ok = df.groupby('id')['date'].agg(lambda x: target.issubset(x))
df2 = df[df['date'].isin(target) & df['id'].map(id_ok)]
or, using transform:
target = {'2019Q4', '2020Q4', '2021Q4'}
mask = df.groupby('id')['date'].transform(lambda x: target.issubset(x))
df2 = df[df['date'].isin(target) & mask]
output:
id date other
0 A 2019Q4 NaN
1 A 2020Q4 NaN
2 A 2021Q4 NaN
4 B 2019Q4 NaN
5 B 2020Q4 NaN
6 B 2021Q4 NaN
11 E 2019Q4 NaN
12 E 2020Q4 NaN
13 E 2021Q4 NaN
id_ok:
id
A True
B True
C False
D False
E True
Name: date, dtype: bool
CodePudding user response:
Aggregate the dates for each ID. Then keep only the IDs which have all your dates
ids = df.groupby("id")["date"].agg(set).apply(lambda x: x.issuperset({"2019Q4", "2020Q4", "2021Q4"}))
>>> df[df["id"].isin(ids.index.where(ids).dropna())&df["date"].isin(["2019Q4", "2020Q4", "2021Q4"])]
id date
0 A 2019Q4
1 A 2020Q4
2 A 2021Q4
4 B 2019Q4
5 B 2020Q4
6 B 2021Q4
11 E 2019Q4
12 E 2020Q4
13 E 2021Q4
