How to match corresponding values to part of string (before and after space)?-CodePudding

I have two dataframes, and want to add values from the 2nd one to the 1st one according to string values, but use partial string matching if there is a space

df1:
cat
small dog
apple

df2:
cat  24
small 5
dog  400
apple 83
pear 55

I normally use "left_join" from tidyverse, which would be

df3 <- left_join(df1, df2, by="column_name")

df3:
cat 24
small dog NA
apple 83

but this means that "small dog" has a missing value. What I want to do this time is find the value for either "small" or "dog", and input whichever is bigger. I'm not able to find a function that will tell R to look separately before or after the space though

CodePudding user response：

Another possible solution, based on inner_join:

library(tidyverse)

df1 %>% 
  mutate(spaces = row_number()*str_detect(column_name, " ")) %>% 
  separate_rows(column_name, sep = " ") %>% 
  inner_join(df2, by="column_name") %>% 
  group_by(spaces) %>% 
  mutate(col2 = if_else(spaces > 0, max(col2), col2),
         column_name = if_else(spaces > 0, str_c(column_name, collapse = " "), 
                       column_name)) %>% 
  ungroup %>% distinct %>% select(-spaces)

#> # A tibble: 3 × 2
#>   column_name  col2
#>   <chr>       <dbl>
#> 1 cat            24
#> 2 small dog     400
#> 3 apple          83

CodePudding user response：

We may use regex_left_join from fuzzyjoin and then do a group by to summarise the second column with max values

library(dplyr)
library(fuzzyjoin)
regex_left_join(df1, df2, by = "column_name") %>% 
 group_by(column_name = column_name.x) %>% 
 summarise(col2 = max(col2))

-output

# A tibble: 3 × 2
  column_name  col2
  <chr>       <dbl>
1 apple          83
2 cat            24
3 small dog     400

data

df1 <- structure(list(column_name = c("cat", "small dog", "apple")), 
class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(column_name = c("cat", "small", "dog", "apple", 
"pear"), col2 = c(24, 5, 400, 83, 55)), class = "data.frame", row.names = c(NA, 
-5L))