Filtering a large data frame based on column values using R-CodePudding

I have a very large dataframe with almost 502493 rows and 261 columns. I want to filter it and need IDs with specific codes (codes starting with 'E'). This is how my data looks like,

IDs	code1	code2
1	C443	E109
2	AX31	M223
1	E341	QWE1
3	E131	M223

My required output is IDs with codes starting with 'E' only.

IDs	code
1	E109
1	E341
3	E131

I am trying to use the 'filter' of dplyr package but not getting the required output. Thanks in advance

CodePudding user response：

We can reshape to 'long' format with pivot_longer and filter by creating a logical vector from the first character extracted (with substr)

library(dplyr)
library(tidyr)
df1 %>%
   pivot_longer(cols = starts_with("code"), 
       values_to = 'code', names_to = NULL) %>%   
   filter(substr(code, 1, 1) == "E")

-output

# A tibble: 3 × 2
    IDs code 
  <int> <chr>
1     1 E109 
2     1 E341 
3     3 E131

If the data is really big, we may do a filter before the pivot_longer to keep only rows having at least one 'E' in the column

df1 %>%
   filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>% 
   pivot_longer(cols = starts_with("code"), 
       values_to = 'code', names_to = NULL) %>%   
   filter(substr(code, 1, 1) == "E")

If it is a very big data, another option is data.table. Convert the data.frame to 'data.table' (setDT), loop across the columns of interest (.SDcols) with lapply, replace the elements that are not starting with "E" to NA, then use fcoalesce to get the first non-NA element for each row using do.call

library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce, 
    lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E", 
      NA)))), .SDcols = patterns("code")])

-output

   IDs code
1:   1 E109
2:   1 E341
3:   3 E131

data

df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31", 
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")), 
class = "data.frame", row.names = c(NA, 
-4L))