I want to query a database using a character vector query_list, and return a dataframe query_output. In this two-column output dataframe, each row corresponds to a single string from the query vector. First of the two columns in the output dataframe (called term) names this string, and the second column (called enzyme) lists all rows of the database in which the query string was found, defined by the column enzyme from the database).
My query and database look as follows:
query_list <- c("term(A)", "term(B)", "term(C)", "term(D15)")
database <- data.frame(enzyme = c("A1", "B1", "C1", "D1", "E1")
,term = c("term(A);term(K);term(Y);term(G);term(F);"
,"term(A);term(K);term(Y);term(G);term(F);"
,"term(H);term(K);term(Y);term(C);term(F);"
,"term(H);term(B);term(Y);term(C);term(F);"
,"term(H);term(K);term(D15);term(G);term(F);"))
the database looks like this:
enzyme term
1 A1 term(A);term(K);term(Y);term(G);term(F);
2 B1 term(A);term(K);term(Y);term(G);term(F);
3 C1 term(H);term(K);term(Y);term(C);term(F);
4 D1 term(H);term(B);term(Y);term(C);term(F);
5 E1 term(H);term(K);term(D15);term(G);term(F);
The resulting dataframe query_output:
> query_output
term enzyme
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
Optimally, the solution would be pipeable, and not a loop (although anything will be appreciated). I don't say what I have tried because I don't really know how to go about it in a concise way.
CodePudding user response:
Try this:
library(tidyr)
database %>%
mutate(term = sub(";$", "", term)) %>%
separate_rows(term, sep = ";") %>%
filter(term %in% query_list) %>%
group_by(term) %>%
summarise(enzyme = toString(enzyme))
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
Using separate_rows() from tidyr package you can separate the values in term. Then, just filter by your query_list, group by term and use paste0(..., collapse=';') to collapse all values for each term in the same row.
database %>%
tidyr::separate_rows(term,sep=";") %>%
filter(term %in% query_list) %>%
group_by(term) %>%
summarise(enzyme = paste0(enzyme,collapse=', '))
Output:
# A tibble: 4 x 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
You can transform your database to your desired format
library(dplyr)
library(tidyr)
transformed_database <- database %>%
separate_rows(term, sep = ';') %>%
filter(term != '') %>%
group_by(term) %>%
summarise(enzyme = paste0(enzyme, collapse = ', '))
transformed_database
#> # A tibble: 9 × 2
#> term enzyme
#> <chr> <chr>
#> 1 term(A) A1, B1
#> 2 term(B) D1
#> 3 term(C) C1, D1
#> 4 term(D15) E1
#> 5 term(F) A1, B1, C1, D1, E1
#> 6 term(G) A1, B1, E1
#> 7 term(H) C1, D1, E1
#> 8 term(K) A1, B1, C1, E1
#> 9 term(Y) A1, B1, C1, D1
Then, querying it is as simple as
transformed_database %>%
filter(term %in% query_list)
#> # A tibble: 4 × 2
#> term enzyme
#> <chr> <chr>
#> 1 term(A) A1, B1
#> 2 term(B) D1
#> 3 term(C) C1, D1
#> 4 term(D15) E1
CodePudding user response:
We can iterate through the query_list in base R, and use enflame from tibble to make it a dataframe.
library(tibble)
enframe(sapply(query_list, function(x)
paste(database[grepl(x, strsplit(database$term, ";"), fixed = T), 1], collapse = ", ")),
name = "term",
value = "enzyme")
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
My solution is:
database %>%
separate_rows(term, sep = ";") %>%
filter(term != "") %>%
filter(term %in% query_list) %>%
print() %>%
group_by(term) %>%
summarise(enzyme = str_c(enzyme, collapse = ", ")) %>%
ungroup()
Which results in
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
Other answers already satisfy the OP's condition. An alternative approach, rethinking the input and output, might be to use environments or hashtab and a memoised function. This takes some setting up and wrangling.
x <- unlist(strsplit(database$term, ";"))
keys <- unique(x)
value <- vector("list", length(keys))
ind <- rep(paste0(LETTERS[1:5], "1"), each = 5)
for (i in seq_along(keys)) {
value[[i]] = ind[which(x == keys[i])]
}
db <- value |> setNames(keys) |> list2env()
grab <- memoise::memoise(Vectorize(function(db, key) db[[key]], c("key")))
The outcome is a pipeable function that will scale quite well for similar queries.
db |>
grab(query_list)
#> $`term(A)`
#> [1] "A1" "B1"
#>
#> $`term(B)`
#> [1] "D1"
#>
#> $`term(C)`
#> [1] "C1" "D1"
#>
#> $`term(D15)`
#> [1] "E1"
For example, what if there were a million items in the query?
large_query <- rep(query_list, 1e6)
bench::mark(
x = grab(db, large_query)
)
#> expression median mem_alloc n_itr total_time
#> <bch:expr> <bch:tm> <bch:byt> <int> <bch:tm>
#> 1 x 160ms 140MB 4 643ms
