I have two dataframes, and want to add values from the 2nd one to the 1st one according to string values, but use partial string matching if there is a space
df1:
cat
small dog
apple
df2:
cat 24
small 5
dog 400
apple 83
pear 55
I normally use "left_join" from tidyverse, which would be
df3 <- left_join(df1, df2, by="column_name")
df3:
cat 24
small dog NA
apple 83
but this means that "small dog" has a missing value. What I want to do this time is find the value for either "small" or "dog", and input whichever is bigger. I'm not able to find a function that will tell R to look separately before or after the space though
CodePudding user response:
Another possible solution, based on inner_join:
library(tidyverse)
df1 %>%
mutate(spaces = row_number()*str_detect(column_name, " ")) %>%
separate_rows(column_name, sep = " ") %>%
inner_join(df2, by="column_name") %>%
group_by(spaces) %>%
mutate(col2 = if_else(spaces > 0, max(col2), col2),
column_name = if_else(spaces > 0, str_c(column_name, collapse = " "),
column_name)) %>%
ungroup %>% distinct %>% select(-spaces)
#> # A tibble: 3 × 2
#> column_name col2
#> <chr> <dbl>
#> 1 cat 24
#> 2 small dog 400
#> 3 apple 83
CodePudding user response:
We may use regex_left_join from fuzzyjoin and then do a group by to summarise the second column with max values
library(dplyr)
library(fuzzyjoin)
regex_left_join(df1, df2, by = "column_name") %>%
group_by(column_name = column_name.x) %>%
summarise(col2 = max(col2))
-output
# A tibble: 3 × 2
column_name col2
<chr> <dbl>
1 apple 83
2 cat 24
3 small dog 400
data
df1 <- structure(list(column_name = c("cat", "small dog", "apple")),
class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(column_name = c("cat", "small", "dog", "apple",
"pear"), col2 = c(24, 5, 400, 83, 55)), class = "data.frame", row.names = c(NA,
-5L))
