I have a very large dataframe with almost 502493 rows and 261 columns. I want to filter it and need IDs with specific codes (codes starting with 'E'). This is how my data looks like,
| IDs | code1 | code2 |
|---|---|---|
| 1 | C443 | E109 |
| 2 | AX31 | M223 |
| 1 | E341 | QWE1 |
| 3 | E131 | M223 |
My required output is IDs with codes starting with 'E' only.
| IDs | code |
|---|---|
| 1 | E109 |
| 1 | E341 |
| 3 | E131 |
I am trying to use the 'filter' of dplyr package but not getting the required output. Thanks in advance
CodePudding user response:
We can reshape to 'long' format with pivot_longer and filter by creating a logical vector from the first character extracted (with substr)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
-output
# A tibble: 3 × 2
IDs code
<int> <chr>
1 1 E109
2 1 E341
3 3 E131
If the data is really big, we may do a filter before the pivot_longer to keep only rows having at least one 'E' in the column
df1 %>%
filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
If it is a very big data, another option is data.table. Convert the data.frame to 'data.table' (setDT), loop across the columns of interest (.SDcols) with lapply, replace the elements that are not starting with "E" to NA, then use fcoalesce to get the first non-NA element for each row using do.call
library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce,
lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E",
NA)))), .SDcols = patterns("code")])
-output
IDs code
1: 1 E109
2: 1 E341
3: 3 E131
data
df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31",
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")),
class = "data.frame", row.names = c(NA,
-4L))
