Data updated!
I have a example data set
| Target | Start | sequence |
|---|---|---|
| A | y1 | ccc |
| A | y2 | cct |
| A | y3 | aag |
| A | y3 | act |
| B | y1 | aaa |
| B | y4 | aat |
and trying to get dataset like in R :
| Target | Start | Start | sequence |
|---|---|---|---|
| A | y1 | y2 | ccc,cct |
| A | y1 | y3 | ccc,aag,act |
| A | y2 | y3 | cct,aag,act |
| B | y1 | y4 | aaa,aat |
Start column alway has a target and looking for common target from each combination of start column without any overlaps and its list of sequence. I have tried to manipulate with mutate() and comb() help with following link: link, however did not achieve wanted result.
Can anyone help me and give me a chance to learn further?
CodePudding user response:
You may achieve this by using combn for each group.
library(dplyr)
library(tidyr)
df %>%
group_by(Target) %>%
summarise(Start = combn(Start, 2, function(x)
list(setNames(x, c('start', 'end')))),
Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
unnest_wider(Start)
# Target start end Sequence
# <chr> <chr> <chr> <chr>
#1 A y1 y2 ccc, cct
#2 A y1 y3 ccc, aag
#3 A y2 y3 cct, aag
#4 B y1 y4 aaa, aat
CodePudding user response:
Here is another tidyverse approach without the use of combn().
group_by(Target, Start)so that any sequence with sameTargetandStartcan be collapsed to a single row- Drop the
Startcolumn ingroup_by() - Change the
Startcolumn into numeric, so that we can directly compare theStartvalues - Create a
Start2column containingStartvalue greater than itself, and extract the correspondingsequencestring and store insequence2column - Expand the dataframe based on
Start2andsequence2(since there would be multiple output per row bysapply) group_by(Target, Start, Start2)so that we canpastesequencewithsequence2
library(tidyverse)
df %>%
group_by(Target, Start) %>%
summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>%
mutate(Start_num = as.numeric(str_extract(Start, "\\d ")),
Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>%
unnest(cols = c(Start2, sequence2)) %>%
group_by(Target, Start, Start2) %>%
summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")
# A tibble: 4 × 4
Target Start Start2 sequence
<chr> <chr> <chr> <chr>
1 A y1 y2 ccc,cct
2 A y1 y3 ccc,aag,act
3 A y2 y3 cct,aag,act
4 B y1 y4 aaa,aat
