Reduce a list of data.frames to a non-redundant data.frame-CodePudding

I have a list of data.frames where certain data.frame elements may have all their columns and their values contained within another data.frame element.

Here's an example list:

df.list <- list(df1 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
                df2 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)),
                df3 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
                df4 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)))

So df.list$df1 is contained within df.list$df3 and df.list$df2 is contained within df.list$df4, because df.list$df3 and df.list$df4 have the age column which df.list$df1 and df.list$df2 do not have.

I would like to Reduce this list into a non-redundant data.frame.

list(unique(Reduce(rbind,df.list)))

wouldn't work because the data.frame elements have different columns (the age column in df.list$df3 and df.list$df4), so I'm looking for something equivalent that will detect that df.list$df1 and df.list$df2 are contained (and therefore redundant) within df.list$df3 and df.list$df4, respectively.

In the case of the example above the resulting data.frame will be:

unique(rbind(df.list$df3, df.list$df4))

CodePudding user response：

If you are intending to reduce out the NA values in age, thinking that they will auto-update, then try this:

Reduce(function(a, b) {
  out <- merge(a, b,
    by = setdiff(intersect(names(a), names(b)), c("age","y")), suffixes = c("", ".z"),
    all = TRUE)
  dotz <- grep("\\.z$", names(out), value = TRUE)
  noz <- gsub("\\.z$", "", dotz)
  dotz <- dotz[noz %in% names(out)]
  noz <- noz[noz %in% names(out)]
  out[noz] <- Map(function(a, b) ifelse(is.na(a), b, a), out[noz], out[dotz])
  out[setdiff(names(out), dotz)]
}, df.list)
#   id name replicate condition    y age
# 1 i1  b_1         1         b 0.20  10
# 2 i1  b_2         2         b 0.10  20
# 3 i1  b_3         3         b 0.05  30
# 4 i1 t1_1         1        t1 0.50  10
# 5 i1 t1_2         2        t1 0.60  20
# 6 i1 t1_3         3        t1 0.20  30
# 7 i1 t2_1         1        t2 0.80  10
# 8 i1 t2_2         2        t2 0.90  20
# 9 i1 t2_3         3        t2 0.70  30

Note that when both values of y (for example) are non-NA, this process will silently drop the second (and subsequent) non-NA values.

CodePudding user response：

As r2evans mentioned in the coments, Reduce() is actually not the best way to go here.

df.list |> 
  seq_along() |>
  # for each data frame, check if every element is also element in one of the others
  lapply(\(x) lapply(df.list[-x], \(y) all(df.list[[x]] %in% y))) |> 
  sapply(\(x) unlist(x) |> Negate(any)()) |> 
  {\(.) df.list[.]}() |> 
  do.call(what = rbind)
#>       id name replicate age condition    y
#> df3.1 i1 t1_1         1  10        t1 0.50
#> df3.2 i1 t1_2         2  20        t1 0.60
#> df3.3 i1 t1_3         3  30        t1 0.20
#> df3.4 i1  b_1         1  10         b 0.20
#> df3.5 i1  b_2         2  20         b 0.10
#> df3.6 i1  b_3         3  30         b 0.05
#> df4.1 i1 t2_1         1  10        t2 0.80
#> df4.2 i1 t2_2         2  20        t2 0.90
#> df4.3 i1 t2_3         3  30        t2 0.70
#> df4.4 i1  b_1         1  10         b 0.20
#> df4.5 i1  b_2         2  20         b 0.10
#> df4.6 i1  b_3         3  30         b 0.05

^{Created on 2022-02-01 by the reprex package (v2.0.1)}