I am trying to create a data frame in R containing indicator variables for whether or not a series of data frames contain certain variables
For instance, suppose I have these three data frames:
lombok:
name color year attend approval
bali:
name color purchases
papua:
name color attend
The resulting data frame would appear as follows:
dataframe name color year attend
df1 TRUE TRUE TRUE TRUE
df2 TRUE TRUE FALSE FALSE
df3 TRUE TRUE FALSE TRUE
In this case, I have selected name, color, year, and attend as the four variables that I want this data frame to report on
How do I do this?
CodePudding user response:
Get the column names after creating a list, stack to a two column data.frame and use table
lst1 <- lapply(mget(ls(pattern = "^(bali|lombok|papua)\\d*$")), names)
table(stack(lst1)[2:1]) > 0
-output
ind approval attend color name purchases year
bali FALSE FALSE TRUE TRUE TRUE FALSE
lombok TRUE TRUE TRUE TRUE FALSE TRUE
papua FALSE TRUE TRUE TRUE FALSE FALSE
Or using tidyverse
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
mget(ls(pattern = "^(bali|lombok|papua)\\d*$")) %>%
map(names) %>%
enframe(name = 'dataframe') %>%
unnest(value) %>%
pivot_wider(names_from = value, values_from = value,
values_fn = list(value = ~ length(.x) > 0), values_fill = FALSE)
-output
# A tibble: 3 × 7
dataframe name color purchases year attend approval
<chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 bali TRUE TRUE TRUE FALSE FALSE FALSE
2 lombok TRUE TRUE FALSE TRUE TRUE TRUE
3 papua TRUE TRUE FALSE FALSE TRUE FALSE
data
lombok <- data.frame(name = 'a', color = 'red', year = 2015,
attend = 'yes', approval = 'yes')
bali <- data.frame(name = 'b', color = 'red', purchases = 10)
papua <- data.frame(name = 'c', color= 'yellow', attend = 'yes')
CodePudding user response:
This works:
# Generating data.
df1 = data.frame("name" = letters, "color" = "blue", "year" = 1986, "attend" = "yes")
df2 = data.frame("name" = letters, color = "blue")
df3 = data.frame("name" = letters, color = "blue", attend = "yes")
# Defining useful list and matrix.
dfs = list(df1, df2, df3) # List storing data frames.
mat = matrix(NA, nrow = length(dfs), ncol = max(sapply(dfs, ncol)))
colnames(mat) = colnames(dfs[[which.max(sapply(dfs, ncol))]])
# Defining useful function.
store.vars = function(dta)
{
# This function takes a data frame as input, and check if it has certain
# variables (as defined in "mat).
# To be used within sapply().
return(colnames(mat) %in% colnames(dta))
}
final.df = data.frame(t(sapply(dfs, store.vars)))
rownames(final.df) = c("df1", "df2", "df3")
The idea is to put all the data frames into a list, and use the latter to define a matrix whose column names correspond to the names of the variables of interest.
Then, we can define store.vars(), which relies on the matrix defined above to scan all the data frames within a list, returning a logical vector with information about which variables are stored in each data frame. Using store.vars() within sapply yields the desired result.
