I have 2 df. the first one is list of id and for each id there list of days 01-01-2013 until 12-31-2013 . I have 30 unique id for example, df1:
id date
1 01-01-2013
1 01-02-2013
1 01-03-2013
....
1 12-31-2013
2 01-01-2013
2 01-02-2013
....
2 12-31-2013
my second df is id, sub_ id (each sub_id contain to some id )dates of diagnosis and label for disease- 0 or 1. for example:
id sub_id date disease
1 345 01-01-2013 0
1 967 01-01-2013 1
2 843 06-03-2013 0
and so on. I want to merge the df base on id and date. so I will get consecutive days of all days in the year for each id with label of disease. I don't care about the sub_id, HOWEVER if I have 2 same dates, and I have at least 1 sick sub_id I want to return label of 1 in that day. and if there is no value, fill nan with 0 for example: df2
id date disease
1 01-01-2013 1
1 01-02-2013 0
1 01-13-2013 0
.....
I have tried this code:
df2.sort_values('disease').drop_duplicates(subset='date', keep='last').merge(df1,on=['id','date'],how='right')
but some id with label 1 its results label of 0 thanks
CodePudding user response:
I'd use groupby:
df2.groupby(['id','date']).max().reset_index().merge(df1,on=['id','date'],how='right')
