I have a dataset where groups undergo treatments at different times, and I need to log the year in which the groups first become treated, else giving the value of 0 for all non-treated groups.
df = pd.DataFrame([['CA',2014,0],['CA',2015,0],['CA',2016,1],['CA',2017,1],
['WA',2011,0],['WA',2012,1],['WA',2013,1],['TX',2010,0]],
columns=['Group_ID','Year','Treated'])
The dataframe should look like this once complete:
| Group_ID | Year | Treated | First_Treated |
|---|---|---|---|
| CA | 2014 | 0 | 0 |
| CA | 2015 | 0 | 0 |
| CA | 2016 | 1 | 2016 |
| CA | 2017 | 1 | 2016 |
| WA | 2011 | 0 | 0 |
| WA | 2012 | 1 | 2012 |
| WA | 2013 | 1 | 2012 |
| TX | 2010 | 0 | 0 |
The Python code below returns every subsequent year value rather than the first year of treatment. I have tried agg() and min() functions but neither work properly.
df['first_treated'] = np.where(df['treated']==1, df['year'], 0)
I have tried agg() and min() functions but neither work properly.
df['first_treated'] = np.where(df['treated']==1, df['year'].min, 0)
I have also used the R code in Create a group variable first.treat indicating the first year when each unit becomes treated, but using an empty first_treated column, no data is inserted into the column with the mutate() function. I receive no errors using that R script on the similar pandas dataframe.
CodePudding user response:
We can create the group key with cumsum , then transform the first value assign it back
s = df['Treated'].eq(0)
df['new'] = df[~s].groupby(df['Treated'].eq(0).cumsum())['Year'].transform('first')
df.new.fillna(0,inplace=True)
#df.new = df.new.astype(int)
df
Group_ID Year Treated new
0 CA 2014 0 0.0
1 CA 2015 0 0.0
2 CA 2016 1 2016.0
3 CA 2017 1 2016.0
4 WA 2011 0 0.0
5 WA 2012 1 2012.0
6 WA 2013 1 2012.0
7 TX 2010 0 0.0
CodePudding user response:
For an R version, we can use the tidyverse package to group_by(group_id) and then use mutate() to create the first_treated column, similar to how you've attempted the problem. We can use the lubridate package to easily handle dates.
df = data.frame(
group_id = c(rep("CA", 4), rep("WA", 3), "TX"),
year = c("2014", "2015", "2016", "2017", "2011", "2012", "2013", "2010"),
treated = c(0,0,1,1,0,1,1,0))
df %>%
group_by(group_id) %>%
mutate(year = lubridate::as_date(year, format = '%Y'),
first_treated = ifelse(treated == 1, min(lubridate::year(year)), 0))
