Apologies if I am not explaining this clearly. But I have the following dataset:
mydata = data.frame (Id =c (1,1,1,1,1,1,1,1,2,2,2,2),
Date = c("2001-01-31", "2001-02-13","2001-05-31",
"2001-06-02","2018-01-31","2018-03-31","2018-07-31",
"2019-04-04","2014-01-31","2014-02-02","2014-04-31",
"2014-05-18"),Outcome = c("CR","CR","Relapse","Relapse",
"CR","CR","CR","Relapse","CR", "CR","Relapse","CR"))
Which outputs the below. As you can see each patient is in certain phases at different times and I would like to capture the earliest dates of when each new phase starts per patient. I would like then to rename these phases to CR1, Relapse1, CR2, Relapse2 and so forth.
Id Date Outcome
1 2001-01-31 CR
1 2001-02-13 CR
1 2001-05-31 Relapse
1 2001-06-02 Relapse
1 2018-01-31 CR
1 2018-03-31 CR
1 2018-07-31 CR
1 2019-04-04 Relapse
2 2014-01-31 CR
2 2014-02-02 CR
2 2014-04-31 Relapse
2 2014-05-18 CR
This is the output I am trying to achieve:
Id CR1 Relapse1 CR2 Relapse2
1 2001-01-31 2001-05-31 2018-01-31 2019-04-04
2 2014-01-31 2014-04-31 2014-05-18 NA
I am not quite sure where to begin with this question and would appreciate any help! Thank you guys!
CodePudding user response:
With tidyverse you can try the following.
(If needed, first arrange by Date so it is in chronological order.)
Create a grouping value depending on change in the Outcome column (from remission to relapse and vice versa). I used rleid from data.table for convenience and created a temporary column, Grp. For example:
Id Date Outcome Grp
<dbl> <chr> <chr> <int>
1 1 2001-01-31 CR 1
2 1 2001-02-13 CR 1
3 1 2001-05-31 Relapse 2
4 1 2001-06-02 Relapse 2
5 1 2018-01-31 CR 3
6 1 2018-03-31 CR 3
7 1 2018-07-31 CR 3
8 1 2019-04-04 Relapse 4
9 2 2014-01-31 CR 1
10 2 2014-02-02 CR 1
11 2 2014-04-31 Relapse 2
12 2 2014-05-18 CR 3
You can see, within each Id, that when the Outcome changes, the Grp increases. That way, subsequent dates with the same Outcome will be included in the same Grp.
The .add argument allows us to add Grp to the prior grouping, which was just Id. So, grouping now by both Grp and Id, you can then slice the first row. slice(1) or slice(n = 1) will keep 1 row within the group. In this case, we have grouped by both Id and Grp, so only 1 row will be kept for a given Grp and Id combination.
Finally, you can add a row number that will allow for the wide output described (a consecutive sequence of numbers for CR and Relapse: 1, 2, 3, 4...). In this case, we group_by both Id and Outcome, and consecutively number the dates for this combination. pivot_wider will put the data into wide format if desired.
library(data.table)
library(tidyverse)
mydata %>%
group_by(Id) %>%
mutate(Grp = rleid(Outcome)) %>%
group_by(Grp, .add = T) %>%
slice(1) %>%
group_by(Id, Outcome) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = Id, names_from = c(Outcome, n), values_from = Date)
Output
Id CR_1 Relapse_1 CR_2 Relapse_2
<dbl> <chr> <chr> <chr> <chr>
1 1 2001-01-31 2001-05-31 2018-01-31 2019-04-04
2 2 2014-01-31 2014-04-31 2014-05-18 NA
