Find specific value in nested data frames and get position and/or value-CodePudding

I have a nested list sampleList that can contain a variable number of data frames. In this example there are 3 data frames:

df1 <- data.frame(id = as.integer(c(1, 6)), key = c('apple', 'apple.green'), stringsAsFactors=FALSE)
df2 <- data.frame(id = as.integer(c(1, 3, 5)), key = c('apple', 'apple.red', 'apple.red.rotten'), stringsAsFactors = FALSE)
df3 <- data.frame(id = as.integer(c(17)), key = c('orange'), stringsAsFactors = FALSE)
sampleList <- list(df1, df2, df3)

I want to search for specific integers e.g. 6 in the id column across all data frames contained in the sampleList. As a result, I need the position and if possible the associated value from the key column.

The closest I got was the position in a specific data frame e.g. 1.

which(sampleList[[1]] == 6)
[1] 2

Since the number of data frames can be different each time, I need a more dynamic query.

Thanks a lot for your help.

CodePudding user response：

I recommend you watch "Hadley Wickham: Managing many models with R" on YouTube if you have nested data, you'll be impressed with how useful it is. Then, I recommend you look at the example by Laurens Geffert, search "Nesting Birds and Models in R Dataframes".

I recommend using tibbles for nicer output, but given the data.frame format requested, I comment-out that coercing to tibbles.

Explanation 1: using dplyr logic with the pipe, we take from the list each object (data.frame) and apply a filter as you would to each data frame separately. The tilde (~) is the functional programming way to say 'apply this following function to all the objects in the list'. This approach is more practical if your goal is to operate on the data.frames without removing the dataframes as separate objects.

library(tidyr)
library(dplyr)
library(purrr)

df1 <- data.frame(id = as.integer(c(1, 6)), key = c('apple', 'apple.green'), stringsAsFactors=FALSE)
df2 <- data.frame(id = as.integer(c(1, 3, 5)), key = c('apple', 'apple.red', 'apple.red.rotten'), stringsAsFactors = FALSE)
df3 <- data.frame(id = as.integer(c(17)), key = c('orange'), stringsAsFactors = FALSE)

lt = lst(df1,# %>% as_tibble(.),
         df2,# %>% as_tibble(.),
         df3 #%>% as_tibble(.)
         )

lt %>% map(~filter(.,id==6))


# $df1
# id         key
# 1  6 apple.green
# 
# $df2
# [1] id  key
# <0 rows> (or 0-length row.names)
# 
# $df3
# [1] id  key
# <0 rows> (or 0-length row.names)

The next example to achieve what you want, or to answer your question(s) about getting values out.

Explanation 2: using lapply, we can get the respective positions in each data.frame or the values of column key, but I suspect you are looking to manipulate multiple data.frames simultaneously. If not, and you're just trying to find locations per data.frame (i.e., getting your hands dirty), then just grab positions with the classic base R logic per data.frame using lapply.

# which values per list object have the requested id==6
lapply(lt,function(x)which(x$id==6))

#value of column key per list object have the requested id==6
lapply(lt,function(x)x$key[which(x$id==6)])

CodePudding user response：

I have slightly altered the data, adding 6 to df3.

df1 <- data.frame(id = as.integer(c(1, 6)), key = c('apple', 'apple.green'), stringsAsFactors=FALSE)
df2 <- data.frame(id = as.integer(c(1, 3, 5)), key = c('apple', 'apple.red', 'apple.red.rotten'), stringsAsFactors = FALSE)
df3 <- data.frame(id = as.integer(c(6, 17)), key = c('orange', 'blue'), stringsAsFactors = FALSE)
sampleList <- list(df1, df2, df3)

tidyverse

library(tidyverse)
imap_dfr(sampleList,
         ~ mutate(.x, pos = 1:n(), dfr = .y) %>%
           filter(id == 6)) %>%
  when(!!nrow(.) ~., ~0)


#>  id         key pos dfr
#> 1  6 apple.green   2   1
#> 2  6      orange   1   3

Explanation: using purrr we can access list indices within the lambda function through .y. The _dfr transforms the list to a tibble. when or {if(!nrow(.)) 0} can be used to conditionally return 0 if no values were found. The . is the placeholder dot in the magrittr pipe.
base R

Filter(nrow, 
       lapply(sampleList, subset, id == 6)
)
[[1]]
  id         key
2  6 apple.green

[[2]]
  id  key
1  6 orange

Explanation: We can first subset the list elements based on criteria, and later Filter out those that have nrow of 0, since F == 0.

To extract the positions (stored as rownames of the data.frames),

Filter(nrow, 
       lapply(sampleList, subset, id == 6)
) |>
  lapply(\(x) as.integer(rownames(x)))

To make it clear in which data.frame matches were found,

Filter(nrow, 
       lapply(sampleList, subset, id == 6) |>
         setNames(1:length(sampleList)) # swap to appropriate naming policy
) |>
  lapply(\(x) as.integer(rownames(x)))