Creating a data frame to track variable names in other data frames-CodePudding

I am trying to create a data frame in R containing indicator variables for whether or not a series of data frames contain certain variables

For instance, suppose I have these three data frames:

lombok:

name     color     year     attend   approval

bali:

name     color    purchases

papua:

name     color   attend

The resulting data frame would appear as follows:

dataframe   name    color    year    attend
df1         TRUE    TRUE     TRUE    TRUE
df2         TRUE    TRUE     FALSE   FALSE
df3         TRUE    TRUE     FALSE   TRUE

In this case, I have selected name, color, year, and attend as the four variables that I want this data frame to report on

How do I do this?

CodePudding user response：

Get the column names after creating a list, stack to a two column data.frame and use table

lst1 <- lapply(mget(ls(pattern = "^(bali|lombok|papua)\\d*$")), names)
table(stack(lst1)[2:1]) > 0

-output

ind      approval attend color name purchases  year
  bali      FALSE  FALSE  TRUE TRUE      TRUE FALSE
  lombok     TRUE   TRUE  TRUE TRUE     FALSE  TRUE
  papua     FALSE   TRUE  TRUE TRUE     FALSE FALSE

Or using tidyverse

library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
mget(ls(pattern = "^(bali|lombok|papua)\\d*$")) %>% 
  map(names) %>% 
  enframe(name = 'dataframe') %>%
  unnest(value) %>%
  pivot_wider(names_from = value, values_from = value,
    values_fn = list(value = ~ length(.x) > 0), values_fill = FALSE)

-output

# A tibble: 3 × 7
  dataframe name  color purchases year  attend approval
  <chr>     <lgl> <lgl> <lgl>     <lgl> <lgl>  <lgl>   
1 bali      TRUE  TRUE  TRUE      FALSE FALSE  FALSE   
2 lombok    TRUE  TRUE  FALSE     TRUE  TRUE   TRUE    
3 papua     TRUE  TRUE  FALSE     FALSE TRUE   FALSE

data

lombok <- data.frame(name = 'a', color = 'red', year = 2015,
     attend = 'yes', approval = 'yes')
bali <- data.frame(name = 'b', color = 'red', purchases = 10)
papua <- data.frame(name = 'c', color= 'yellow', attend = 'yes')

CodePudding user response：

This works:

# Generating data.
df1 = data.frame("name" = letters, "color" = "blue", "year" = 1986, "attend" = "yes")
df2 = data.frame("name" = letters, color = "blue")
df3 = data.frame("name" = letters, color = "blue", attend = "yes")

# Defining useful list and matrix.
dfs = list(df1, df2, df3) # List storing data frames.
mat = matrix(NA, nrow = length(dfs), ncol = max(sapply(dfs, ncol)))
colnames(mat) = colnames(dfs[[which.max(sapply(dfs, ncol))]])

# Defining useful function.
store.vars = function(dta)
{
  # This function takes a data frame as input, and check if it has certain
  # variables (as defined in "mat).

  # To be used within sapply().

  return(colnames(mat) %in% colnames(dta))
}

final.df = data.frame(t(sapply(dfs, store.vars)))
rownames(final.df) = c("df1", "df2", "df3")

The idea is to put all the data frames into a list, and use the latter to define a matrix whose column names correspond to the names of the variables of interest.

Then, we can define store.vars(), which relies on the matrix defined above to scan all the data frames within a list, returning a logical vector with information about which variables are stored in each data frame. Using store.vars() within sapply yields the desired result.