I have tried to look for a similar question and I´m sure other people encountered this problem but I still couldn´t find something that helped me. I have a dataset1 with 37.000 observations like this:
id hours
130 12
165 56
250 13
11 15
17 42
and another dataset2 with 38. 000 observations like this:
id hours
130 6
165 23
250 9
11 14
17 11
I want to do the following: if an id of dataset1 is in dataset2, the hours of dataset1 should override the hours of dataset2. For the id´s who are in dataset1 but not in dataset2, the value for dataset2$hours should be NA.
I tried the %in% operator, ifelse(), a loop, and some base R commands but I can´t figure it out. I always get the error that the vectors don´have the same length.
Thanks for any help!
CodePudding user response:
You can replace hours with NAs for id that don't match between df1 and df2. Since both your data sets had the same values for ids, I added one row in df1 with id = 123 and hours = 12.
df1$hours <- replace(df1$hours, is.na(match(df1$id,df2$id)), NA)
df1
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 NA
data
df1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 123L), hours = c(12L,
56L, 13L, 15L, 42L, NA)), row.names = c(NA, -6L), class = "data.frame")
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 12
df2 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L), hours = c(6L,
23L, 9L, 14L, 11L)), class = "data.frame", row.names = c(NA,
-5L))
CodePudding user response:
First match ID's of replacement data with ID's of original data while using na.omit() for the case when replacement ID's are not contained in original data. Replace with replacement data whose ID's are in original ID's.
I expanded both data sets to fabricate cases with no matches.
dat1
# id hours
# 1 130 12
# 2 165 56
# 3 250 13
# 4 11 15
# 5 17 42
# 6 12 232
# 7 35 456
dat2
# id hours
# 1 11 14
# 2 17 11
# 3 165 23
# 4 999 99
# 5 130 6
# 6 250 9
Replacement
dat1[na.omit(match(dat2$id, dat1$id)), ]$hours <-
dat2[dat2$id %in% dat1$id, ]$hours
dat1
# id hours
# 1 130 6
# 2 165 23
# 3 250 9
# 4 11 14
# 5 17 11
# 6 12 232
# 7 35 456
Data:
dat1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 12L, 35L),
hours = c(12L, 56L, 13L, 15L, 42L, 232L, 456L)), class = "data.frame", row.names = c(NA,
-7L))
dat2 <- structure(list(id = c(11L, 17L, 165L, 999L, 130L, 250L), hours = c(14L,
11L, 23L, 99L, 6L, 9L)), class = "data.frame", row.names = c(NA,
-6L))
