I have two data frames, df1 and df2 of identical structure. The first three columns, id, form, and instance identify the participant and form(s). The remaining variable columns, var1, var2, and var3, contain analytic data and are largely identical except for a few slight discrepancies.
I have a third data frame, map that identifies the id, form(s), and variables that have discrepancies, but does not contain the values of these discrepancies. I would like to use the map data to create a final data frame, df.final that appends two columns and places the discrepant values from df1 and df2.
Below I am providing sample data as well as a clunky for loop that creates the desired df.final, but it is very slow (takes several hours to run on the full dataset) - so slow in fact it is functionally unusable (this ideally needs to be updated in near real time). I am hoping someone better at coding than me can provide alternative, faster solutions.
(Note that given the simplicity of the example data, there are alternate ways to compare df1 and df2, but assume these don't work and using map is the only option.)
# Example data, df1
df1 <- data.frame(id = rep(sprintf("K00%s", 0:9), each = 3),
form = rep(c("A","B", "B"), times = 10),
instance = rep(c("None", "1", "2"), times = 10),
var1 = sample(LETTERS, 30, replace = TRUE),
var2 = rnbinom(30, mu = 1, size = 0.02),
var3 = sample(c("Apples", "Oranges", "Pears"), 30, replace = TRUE))
# Sample data df2, same as df1 but with slight discrepancies
df2 <- df1
df2[15, 4] <- "A"
df2[c(4, 6, 8), 5] <- c(11,15,16)
df2[27:28, 6] <- "Bannanas"
# Example "Map" that only indicates what ID, form, instance, and variable is discrepant
map <- data.frame(id = c("K004", "K001", "K001", "K002", "K008", "K009"),
form = c("B","A", "B", "B", "B", "A"),
instance = c("2", "None", "2", "1", "2", "None"),
variable = c("var1", rep("var2", 3), "var3", "var3"))
# id form instance variable
# 1 K004 B 2 var1
# 2 K001 A None var2
# 3 K001 B 2 var2
# 4 K002 B 1 var2
# 5 K008 B 2 var3
# 6 K009 A None var3
## - - - - - - - - - - - - - - - - - - -
# Attempt, but VERY slow in full data
## - - - - - - - - - - - - - - - - - - -
df.final <- data.frame(matrix(NA, ncol = 6))
for (i in 1:nrow(map)){
keepcols <- c("id", "form","instance", map[i,4])
m1 <- merge(map[i,], df1[ , keepcols], by = keepcols[-4])
m2 <- merge(m1, df2[, keepcols], by = keepcols[-4])
df.final[i,] <- m2
}
names(df.final) <- c(names(map), "df1_entry","df2_entry")
df.final
# id form instance variable df1_entry df2_entry
# 1 K004 B 2 var1 P A
# 2 K001 A None var2 0 11
# 3 K001 B 2 var2 0 15
# 4 K002 B 1 var2 0 16
# 5 K008 B 2 var3 Pears Bannanas
# 6 K009 A None var3 Apples Bannanas
CodePudding user response:
I didn't use your map frame, so my apologies if I have not answered your question, but I didn't think you needed it. I believe this does what you want:
inner_join(
df1 %>% mutate(across(starts_with("var"),as.character)) %>% pivot_longer(cols=var1:var3,values_to = "df1"),
df2 %>% mutate(across(starts_with("var"),as.character)) %>% pivot_longer(cols=var1:var3, values_to="df2"),
on=c("id","form","instance","name")
) %>% filter(df1!=df2)
Output:
id form instance name df1 df2
<chr> <chr> <chr> <chr> <chr> <chr>
1 K001 A None var2 0 11
2 K001 B 2 var2 0 15
3 K002 B 1 var2 0 16
4 K004 B 2 var1 J A
5 K008 B 2 var3 Apples Bannanas
6 K009 A None var3 Apples Bannanas
If you absolutely have to use map, but you have access to df1 and df2, you can do this:
map %>%
inner_join(
df1 %>%
mutate(across(starts_with("var"),as.character)) %>%
pivot_longer(cols=var1:var3,values_to = "df1",names_to="variable"),
by=c("id","form","instance","variable")
) %>%
inner_join(
df2 %>%
mutate(across(starts_with("var"),as.character)) %>%
pivot_longer(cols=var1:var3,values_to = "df2",names_to="variable"),
by=c("id","form","instance","variable")
)
)
CodePudding user response:
In case this might be useful, you can do the following in data.table:
library(data.table)
setDT(df1)
setDT(df2)
merge(
melt(df1, 1:3),
melt(df2, 1:3),
by = c("id", "form", "instance", "variable")
)[value.x != value.y]
Output
id form instance variable value.x value.y
1: K001 A None var2 1 11
2: K001 B 2 var2 22 15
3: K002 B 1 var2 0 16
4: K004 B 2 var1 S A
5: K008 B 2 var3 Oranges Bannanas
6: K009 A None var3 Pears Bannanas
