I have an R data.frame of linear intervals:
df <- data.frame(id = paste0("i",1:15),
start = c(6575,7156,7949,45835,46347,47168,126804,127276,128127,157597,158074,158902,199129,199704,200507),
end = c(6928,7392,8260,46104,46610,47485,127079,127542,128417,157872,158340,159219,199374,199951,200938))
I also have an inter-interval distance cutoff:
inter.interval.distance.cutoff <- 3243
df is sorted by start and end. The first interval is labeled to belong to group g1 and from there on any interval which is separated by the interval preceding it by a distance (which is defined as start of the current interval minus the end of the interval preceding it) that's equal or less to inter.interval.distance.cutoff is assigned to the group of the interval preceding it, otherwise it starts a new group (the group index is incremented by 1 which is how ew get a new group label).
Here's my desired outcome:
df$group <- c(rep("g1",3), rep("g2",3), rep("g3",3), rep("g4",3), rep("g5",3))
Any fast way for doing it?
CodePudding user response:
df$group <- paste0('g', cumsum(c(1, diff(df$start)>inter.interval.distance.cutoff)))
id start end f
1 i1 6575 6928 g1
2 i2 7156 7392 g1
3 i3 7949 8260 g1
4 i4 45835 46104 g2
5 i5 46347 46610 g2
6 i6 47168 47485 g2
7 i7 126804 127079 g3
8 i8 127276 127542 g3
9 i9 128127 128417 g3
10 i10 157597 157872 g4
11 i11 158074 158340 g4
12 i12 158902 159219 g4
13 i13 199129 199374 g5
14 i14 199704 199951 g5
15 i15 200507 200938 g5
