Use a character vector to build a new dataframe containing rows in which strings from that vector ar-CodePudding

I want to query a database using a character vector query_list, and return a dataframe query_output. In this two-column output dataframe, each row corresponds to a single string from the query vector. First of the two columns in the output dataframe (called term) names this string, and the second column (called enzyme) lists all rows of the database in which the query string was found, defined by the column enzyme from the database).

My query and database look as follows:

query_list <- c("term(A)", "term(B)", "term(C)", "term(D15)")
database <- data.frame(enzyme = c("A1", "B1", "C1", "D1", "E1")
                       ,term = c("term(A);term(K);term(Y);term(G);term(F);"
                                 ,"term(A);term(K);term(Y);term(G);term(F);"
                                 ,"term(H);term(K);term(Y);term(C);term(F);"
                                 ,"term(H);term(B);term(Y);term(C);term(F);"
                                 ,"term(H);term(K);term(D15);term(G);term(F);"))

the database looks like this:

  enzyme                                       term
1     A1   term(A);term(K);term(Y);term(G);term(F);
2     B1   term(A);term(K);term(Y);term(G);term(F);
3     C1   term(H);term(K);term(Y);term(C);term(F);
4     D1   term(H);term(B);term(Y);term(C);term(F);
5     E1   term(H);term(K);term(D15);term(G);term(F);

The resulting dataframe query_output:

> query_output 
       term  enzyme
1   term(A)  A1, B1
2   term(B)      D1
3   term(C)  C1, D1
4 term(D15)      E1

Optimally, the solution would be pipeable, and not a loop (although anything will be appreciated). I don't say what I have tried because I don't really know how to go about it in a concise way.

CodePudding user response：

Try this:

library(tidyr)
database %>%
  mutate(term = sub(";$", "", term)) %>%
  separate_rows(term, sep = ";") %>%
  filter(term %in% query_list) %>%
  group_by(term) %>%
  summarise(enzyme = toString(enzyme))
# A tibble: 4 × 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1
2 term(B)   D1    
3 term(C)   C1, D1
4 term(D15) E1

CodePudding user response：

Using separate_rows() from tidyr package you can separate the values in term. Then, just filter by your query_list, group by term and use paste0(..., collapse=';') to collapse all values for each term in the same row.

database %>% 
  tidyr::separate_rows(term,sep=";") %>% 
  filter(term %in% query_list) %>%
  group_by(term) %>% 
  summarise(enzyme = paste0(enzyme,collapse=', '))

Output:

# A tibble: 4 x 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1 
2 term(B)   D1    
3 term(C)   C1, D1 
4 term(D15) E1

CodePudding user response：

You can transform your database to your desired format

library(dplyr)
library(tidyr)

transformed_database <- database %>%
  separate_rows(term, sep = ';') %>%
  filter(term != '') %>%
  group_by(term) %>%
  summarise(enzyme = paste0(enzyme, collapse = ', '))

transformed_database
#> # A tibble: 9 × 2
#>   term      enzyme            
#>   <chr>     <chr>             
#> 1 term(A)   A1, B1            
#> 2 term(B)   D1                
#> 3 term(C)   C1, D1            
#> 4 term(D15) E1                
#> 5 term(F)   A1, B1, C1, D1, E1
#> 6 term(G)   A1, B1, E1        
#> 7 term(H)   C1, D1, E1        
#> 8 term(K)   A1, B1, C1, E1    
#> 9 term(Y)   A1, B1, C1, D1

Then, querying it is as simple as

transformed_database %>%
  filter(term %in% query_list)

#> # A tibble: 4 × 2
#>   term      enzyme
#>   <chr>     <chr> 
#> 1 term(A)   A1, B1
#> 2 term(B)   D1    
#> 3 term(C)   C1, D1
#> 4 term(D15) E1

CodePudding user response：

We can iterate through the query_list in base R, and use enflame from tibble to make it a dataframe.

library(tibble)

enframe(sapply(query_list, function(x)
  paste(database[grepl(x, strsplit(database$term, ";"), fixed = T), 1], collapse = ", ")),
  name = "term",
  value = "enzyme")

# A tibble: 4 × 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1
2 term(B)   D1    
3 term(C)   C1, D1
4 term(D15) E1

CodePudding user response：

My solution is:

database %>% 
  separate_rows(term, sep = ";") %>% 
  filter(term != "") %>%
  filter(term %in% query_list) %>% 
  print() %>% 
  group_by(term) %>% 
  summarise(enzyme = str_c(enzyme, collapse = ", ")) %>% 
  ungroup()

Which results in

# A tibble: 4 × 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1
2 term(B)   D1    
3 term(C)   C1, D1
4 term(D15) E1

CodePudding user response：

Other answers already satisfy the OP's condition. An alternative approach, rethinking the input and output, might be to use environments or hashtab and a memoised function. This takes some setting up and wrangling.

x <- unlist(strsplit(database$term, ";"))
keys <- unique(x)
value <- vector("list", length(keys))
ind <- rep(paste0(LETTERS[1:5], "1"), each = 5)
for (i in seq_along(keys)) {
  value[[i]] = ind[which(x == keys[i])]
}
db <- value |> setNames(keys) |> list2env()

grab <- memoise::memoise(Vectorize(function(db, key) db[[key]], c("key")))

The outcome is a pipeable function that will scale quite well for similar queries.

db |>
  grab(query_list)

#> $`term(A)`
#> [1] "A1" "B1"
#> 
#> $`term(B)`
#> [1] "D1"
#> 
#> $`term(C)`
#> [1] "C1" "D1"
#>
#> $`term(D15)`
#> [1] "E1"

For example, what if there were a million items in the query?

large_query <- rep(query_list, 1e6)

bench::mark(
  x = grab(db, large_query)
)

#>  expression   median mem_alloc n_itr total_time
#>  <bch:expr> <bch:tm> <bch:byt> <int>   <bch:tm>
#> 1 x             160ms     140MB     4      643ms