I have a dataset like the one below (actual dataset has 5M rows with no gaps), where I am trying to filter out rows where the sum of all numeric columns for the row itself and its previous and next rows is equal to zero.
N.B.
Timeis adttmcolumn in the actual data.- Number of consecutive zeros can be more than 3 rows and in that case multiple rows will be filtered out.
# A tibble: 13 x 4
group Time Val1 Val2
<chr> <int> <dbl> <dbl>
1 A 1 0 0
2 B 1 0.1 0
3 A 3 0 0
4 B 3 0 0
5 A 2 0 0
6 B 2 0.2 0.2
7 B 4 0 0
8 A 4 0 0.1
9 A 5 0 0
10 A 6 0 0
11 B 6 0.1 0.5
12 B 5 0.1 0.2
13 A 7 0 0
See the example below for what is desired:
# A tibble: 13 x 8
group Time Val1 Val2 rowsum leadsum lagsum sum
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 0 0 NA NA
2 A 2 0 0 0 0 0 0 This will get filtered out!
3 A 3 0 0 0 0.1 0 0.1
4 A 4 0 0.1 0.1 0 0 0.1
5 A 5 0 0 0 0 0.1 0.1
6 A 6 0 0 0 0 0 0 This will get filtered out!
7 A 7 0 0 0 NA 0 NA
8 B 1 0.1 0 0.1 0.4 NA NA
9 B 2 0.2 0.2 0.4 0 0.1 0.5
10 B 3 0 0 0 0 0.4 0.4
11 B 4 0 0 0 0.3 0 0.3
12 B 5 0.1 0.2 0.3 0.6 0 0.9
13 B 6 0.1 0.5 0.6 NA 0.3 NA
So far I have tried to do this simply by using dplyr::lag() and dplyr::lead(); but this is extremely inefficient and throws a memory allocation error for the actual dataset:
> Error in Sys.getenv("TESTTHAT") : > could not allocate memory (0 Mb) in C function 'R_AllocStringBuffer'
This is what I have so far; I can first get the sum of Val1 and Val2 and then perform lead and lag but that won't resolve the issue.
df0 %>%
##arrange by group is not necessary since we're grouping by that var
arrange(group, Time) %>%
group_by(group) %>%
mutate(sum = Val1 Val2 lag(Val1) lag(Val2) lead(Val1) lead(Val2)) # %>%
# filter(is.na(sum) | sum != 0)
## commenting out filter to show the full results
# > # A tibble: 13 x 5
# > # Groups: group [2]
# > group Time Val1 Val2 sum
# > <chr> <int> <dbl> <dbl> <dbl>
# > 1 A 1 0 0 NA
# ! - A 2 0 0 0
# > 2 A 3 0 0 0.1
# > 3 A 4 0 0.1 0.1
# > 4 A 5 0 0 0.1
# ! - A 6 0 0 0
# > 5 A 7 0 0 NA
# > 6 B 1 0.1 0 NA
# > 7 B 2 0.2 0.2 0.5
# > 8 B 3 0 0 0.4
# > 9 B 4 0 0 0.3
# > 10 B 5 0.1 0.2 0.9
# > 11 B 6 0.1 0.5 NA
Toy dataset:
df0 <- structure(list(group = c("A", "B", "A", "B", "A", "B",
"B", "A", "A", "A", "B", "B", "A"),
Time = c(1L, 1L, 3L, 3L, 2L, 2L, 4L, 4L, 5L, 6L, 6L, 5L, 7L),
Val1 = c(0, 0.1, 0, 0, 0, 0.2, 0, 0, 0, 0, 0.1, 0.1, 0),
Val2 = c(0, 0, 0, 0, 0, 0.2, 0, 0.1, 0, 0, 0.5, 0.2, 0)),
row.names = c(NA, -13L),
class = c("tbl_df", "tbl", "data.frame"))
CodePudding user response:
We can use base rle, or its faster implementation, rlenc implemented in the purler package.
library(tidyverse)
library(purler)
subsetter <- function(df){
df %>%
select(where(is.double)) %>%
rowSums() %>%
purler::rlenc() %>%
filter(lengths >= 3L & values == 0L) %>%
transmute(ids = map2(start, start lengths, ~ (.x 1) : (.y - 2))) %>%
unlist(use.names = F)
}
# to get data as shown in example
df0 <- df0 %>%
mutate(Time = as.character(Time)) %>%
arrange(group, Time)
edge_cases <- tribble(
~group, ~Time, ~Val1, ~Val2,
"C", "1", 0, 0,
"C", "2", 0, 0,
"C", "3", 0, 0,
"C", "4", 0, 0,
)
df1 <- rbind(df0, edge_cases)
df1 %>%
`[`(-subsetter(.),)
# A tibble: 13 x 4
group Time Val1 Val2
<chr> <chr> <dbl> <dbl>
1 A 1 0 0
2 A 3 0 0
3 A 4 0 0.1
4 A 5 0 0
5 A 7 0 0
6 B 1 0.1 0
7 B 2 0.2 0.2
8 B 3 0 0
9 B 4 0 0
10 B 5 0.1 0.2
11 B 6 0.1 0.5
12 C 1 0 0
13 C 4 0 0
bench::mark(df1 %>% `[`(-subsetter(.),))[,c(3,5,7)]
# A tibble: 1 x 3
median mem_alloc n_itr
<bch:tm> <bch:byt> <int>
1 3.91ms 9.38KB 93
CodePudding user response:
Since you tagged data.table, here's a data.table-native solution:
library(data.table)
dt0 <- as.data.table(df0)
setorder(dt0, Time) # add 'group' if you want
isnum <- names(which(sapply(dt0, function(z) is.numeric(z) & !is.integer(z))))
isnum
# [1] "Val1" "Val2"
dt0[, sum0 := abs(rowSums(.SD)) < 1e-9, .SDcols = isnum
][, .SD[(c(0,sum0[-.N]) sum0 c(sum0[-1],0)) < 3,], by = .(group)
][, sum0 := NULL ]
# group Time Val1 Val2
# <char> <int> <num> <num>
# 1: A 1 0.0 0.0
# 2: A 3 0.0 0.0
# 3: A 4 0.0 0.1
# 4: A 5 0.0 0.0
# 5: A 7 0.0 0.0
# 6: B 1 0.1 0.0
# 7: B 2 0.2 0.2
# 8: B 3 0.0 0.0
# 9: B 4 0.0 0.0
# 10: B 5 0.1 0.2
# 11: B 6 0.1 0.5
Per your comment, both A-2 and A-6 have been removed.
Efficiencies:
rowSumsis fast and efficient;- We shift using direct indexing with a default of
0; indata.table, this is handled very efficiently, and does not incur the (admittedly small) overhead oflead/lag/shiftcalls; - After we sum a row, we only row-shift this one value instead of four row-shifts per row.
CodePudding user response:
library(tidyverse)
df0 %>%
arrange(group, Time) %>% # EDIT to arrange by time (and group for clarity)
rowwise() %>%
mutate(sum = sum(c_across(Val1:Val2))) %>%
group_by(group) %>%
filter( !(sum == 0 & lag(sum, default = 1) == 0 & lead(sum, default = 1) == 0)) %>%
ungroup()
# A tibble: 11 x 5
group Time Val1 Val2 sum
<chr> <int> <dbl> <dbl> <dbl>
1 A 1 0 0 0
2 A 3 0 0 0
3 A 4 0 0.1 0.1
4 A 5 0 0 0
5 A 7 0 0 0
6 B 1 0.1 0 0.1
7 B 2 0.2 0.2 0.4
8 B 3 0 0 0
9 B 4 0 0 0
10 B 5 0.1 0.2 0.3
11 B 6 0.1 0.5 0.6
