I have a dataframe with 1000 IDs, each with > 100 rows of data. I want to remove all IDs that meet a criteria based on another column at least once.
As an example with the dummy data below, I want to remove all IDs, where var2 is <20 at least once.
How do I do this without spelling out each individual ID to be dropped?
dummy data of similar structure:
data <- data.frame(ID = rep(c('B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10'), each = 5),
var1 = rep(c('a', 'b', 'b', 'c', 'd','a', 'c', 'c', 'b', 'a' ), times = 5),
var2 = sample(1:100, 50))
I have tried using the function droplevel, but I do not want to spell out every individual ID to be dropped.
CodePudding user response:
tidyverse
df <- data.frame(ID = rep(c('B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10'), each = 5),
var1 = rep(c('a', 'b', 'b', 'c', 'd','a', 'c', 'c', 'b', 'a' ), times = 5),
var2 = sample(1:100, 50))
library(tidyverse)
df %>%
group_by(ID) %>%
filter(!any(var2 < 20)) %>%
ungroup()
#> # A tibble: 25 x 3
#> ID var1 var2
#> <chr> <chr> <int>
#> 1 B2 a 100
#> 2 B2 c 67
#> 3 B2 c 64
#> 4 B2 b 78
#> 5 B2 a 73
#> 6 B3 a 83
#> 7 B3 b 32
#> 8 B3 b 23
#> 9 B3 c 65
#> 10 B3 d 96
#> # ... with 15 more rows
Created on 2022-01-14 by the reprex package (v2.0.1)
data.table
library(data.table)
setDT(df)[, .SD[!any(var2 < 20)], by = ID]
#> ID var1 var2
#> 1: B1 a 47
#> 2: B1 b 81
#> 3: B1 b 95
#> 4: B1 c 48
#> 5: B1 d 43
#> 6: B4 a 77
#> 7: B4 c 54
#> 8: B4 c 23
#> 9: B4 b 55
#> 10: B4 a 25
#> 11: B6 a 98
#> 12: B6 c 99
#> 13: B6 c 86
#> 14: B6 b 92
#> 15: B6 a 33
#> 16: B7 a 73
#> 17: B7 b 94
#> 18: B7 b 62
#> 19: B7 c 40
#> 20: B7 d 49
#> 21: B10 a 66
#> 22: B10 c 44
#> 23: B10 c 35
#> 24: B10 b 76
#> 25: B10 a 38
#> ID var1 var2
Created on 2022-01-14 by the reprex package (v2.0.1)
CodePudding user response:
I just found the answer here: How to remove all rows belonging to a particular group when only one row fulfills the condition in R?
This does the trick:
new.data <- subset(data, ave(var2 >=20, ID, FUN = all))
