R: select the IDs with 2 records (rows) in a certain domain-CodePudding

ID <- c(1,1,1,2,2,2,2,3,4,4,4,5,5,5,5,5)

date <- c('2018-06-29','2018-08-29','2018-08-29','2020-12-11','2020-12-12','2021-01-21','2021-12-21','2020-11-29','2019-07-12','2018-05-13','2021-06-08','2017-11-14','2018-07-15','2019-01-15','2020-02-19','2021-11-09')

domain <- c('condition','condition','procedure','condition','procedure','condition','condition','procedure','condition','procedure','procedure','procedure','condition','procedure','condition','procedure')

df <- cbind(ID, date, domain)

Example dateset:

ID  date        domain
1   2018-06-29  condition
1   2018-08-29  condition
1   2018-08-29  procedure
2   2020-12-11  condition
2   2020-12-12  procedure
2   2021-01-21  condition
2   2021-12-21  condition
3   2020-11-29  procedure
4   2019-07-12  condition
4   2018-05-13  procedure
4   2021-06-08  procedure
5   2017-11-14  procedure
5   2018-07-15  condition
5   2019-01-15  procedure
5   2020-02-19  condition
5   2021-11-09  procedure

As you can see, some IDs are with one row in domain 'procedure', while others are with two or three rows in domain 'procedure'. May I ask how to separate the IDs with different number of rows in domain 'procedure'?

Thank you!!

CodePudding user response：

We can use split to split the 'ID' based on the number of 'procedure' elements in domain per 'ID' into a list of vectors

with(df, split(ID, ave(domain == "procedure", ID, FUN = sum)))

-output

$`1`
[1] 1 1 1 2 2 2 2 3

$`2`
[1] 4 4 4

$`3`
[1] 5 5 5 5 5

If we want unique IDs

with(with(df, stack(tapply(domain == "procedure", ID, FUN = sum))), 
      split(as.character(ind), values))

Update

If we wanted the separate the data.frame into subset of data.frame based on the number of occurrence of 'procedure' for each 'ID', then split on the dataset instead of the column

 split(df, ave(domain == "procedure", ID, FUN = sum))
$`1`
  ID       date    domain
1  1 2018-06-29 condition
2  1 2018-08-29 condition
3  1 2018-08-29 procedure
4  2 2020-12-11 condition
5  2 2020-12-12 procedure
6  2 2021-01-21 condition
7  2 2021-12-21 condition
8  3 2020-11-29 procedure

$`2`
   ID       date    domain
9   4 2019-07-12 condition
10  4 2018-05-13 procedure
11  4 2021-06-08 procedure

$`3`
   ID       date    domain
12  5 2017-11-14 procedure
13  5 2018-07-15 condition
14  5 2019-01-15 procedure
15  5 2020-02-19 condition
16  5 2021-11-09 procedure

data

df <- data.frame(ID, date, domain)

CodePudding user response：

If you want a table of counts, you can do this:

df %>% 
  group_by(ID) %>% 
  summarise(nProcedure = sum(domain == "procedure")) %>% 
  group_by(nProcedure) %>% 
  summarise(n = length(ID),
            ID = paste(ID, collapse = ", "))

  nProcedure     n ID     
       <int> <int> <chr>  
1          1     3 1, 2, 3
2          2     1 4      
3          3     1 5