I have a list of data.frames where certain data.frame elements may have all their columns and their values contained within another data.frame element.
Here's an example list:
df.list <- list(df1 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
df2 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)),
df3 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
df4 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)))
So df.list$df1 is contained within df.list$df3 and df.list$df2 is contained within df.list$df4, because df.list$df3 and df.list$df4 have the age column which df.list$df1 and df.list$df2 do not have.
I would like to Reduce this list into a non-redundant data.frame.
list(unique(Reduce(rbind,df.list)))
wouldn't work because the data.frame elements have different columns (the age column in df.list$df3 and df.list$df4), so I'm looking for something equivalent that will detect that df.list$df1 and df.list$df2 are contained (and therefore redundant) within df.list$df3 and df.list$df4, respectively.
In the case of the example above the resulting data.frame will be:
unique(rbind(df.list$df3, df.list$df4))
CodePudding user response:
If you are intending to reduce out the NA values in age, thinking that they will auto-update, then try this:
Reduce(function(a, b) {
out <- merge(a, b,
by = setdiff(intersect(names(a), names(b)), c("age","y")), suffixes = c("", ".z"),
all = TRUE)
dotz <- grep("\\.z$", names(out), value = TRUE)
noz <- gsub("\\.z$", "", dotz)
dotz <- dotz[noz %in% names(out)]
noz <- noz[noz %in% names(out)]
out[noz] <- Map(function(a, b) ifelse(is.na(a), b, a), out[noz], out[dotz])
out[setdiff(names(out), dotz)]
}, df.list)
# id name replicate condition y age
# 1 i1 b_1 1 b 0.20 10
# 2 i1 b_2 2 b 0.10 20
# 3 i1 b_3 3 b 0.05 30
# 4 i1 t1_1 1 t1 0.50 10
# 5 i1 t1_2 2 t1 0.60 20
# 6 i1 t1_3 3 t1 0.20 30
# 7 i1 t2_1 1 t2 0.80 10
# 8 i1 t2_2 2 t2 0.90 20
# 9 i1 t2_3 3 t2 0.70 30
Note that when both values of y (for example) are non-NA, this process will silently drop the second (and subsequent) non-NA values.
CodePudding user response:
As r2evans mentioned in the coments, Reduce() is actually not the best way to go here.
df.list |>
seq_along() |>
# for each data frame, check if every element is also element in one of the others
lapply(\(x) lapply(df.list[-x], \(y) all(df.list[[x]] %in% y))) |>
sapply(\(x) unlist(x) |> Negate(any)()) |>
{\(.) df.list[.]}() |>
do.call(what = rbind)
#> id name replicate age condition y
#> df3.1 i1 t1_1 1 10 t1 0.50
#> df3.2 i1 t1_2 2 20 t1 0.60
#> df3.3 i1 t1_3 3 30 t1 0.20
#> df3.4 i1 b_1 1 10 b 0.20
#> df3.5 i1 b_2 2 20 b 0.10
#> df3.6 i1 b_3 3 30 b 0.05
#> df4.1 i1 t2_1 1 10 t2 0.80
#> df4.2 i1 t2_2 2 20 t2 0.90
#> df4.3 i1 t2_3 3 30 t2 0.70
#> df4.4 i1 b_1 1 10 b 0.20
#> df4.5 i1 b_2 2 20 b 0.10
#> df4.6 i1 b_3 3 30 b 0.05
Created on 2022-02-01 by the reprex package (v2.0.1)
