Home > Enterprise >  Tidyverse: Match word in string from list of keywords
Tidyverse: Match word in string from list of keywords

Time:01-06

I'm trying to write some code that will check to see if a string contains any words contained in a list of terms, in order to create a new column in the dataframe.

This is the list of terms: vehicles <- c('vehicle', 'mazda', 'nissan', 'ford', 'honda', 'chevrolet', 'toyota')

Examples of the strings I'm searching include: "2001 honda civic", "2003 nissan altima", "2005 mazda 5", etc. (these are the asset_name in the code below).

my simplified code looks like this:

df %>%
  mutate(
    asset_type = case_when(
      vehicles %in% asset_name == TRUE ~ 'vehicle', # this doesn't work, obviously
      <CODE THAT DOES WORK HERE!!!>
      TRUE ~ asset_name
    )
  )

I've tried str_detect, str_extract, grepl & a custom function but can't seem to figure out how to make this work.

I know that for each asset_name entry, I need to loop through the list of vehicles to see if one of the vehicle models is in asset_name but I can't seem to make it work. grr...

Thanks in advance!!!

CodePudding user response:

One approach might be to build a regex alternation of the vehicle terms, and then use grepl to match:

vehicles <- c('vehicle', 'mazda', 'nissan', 'ford', 'honda', 'chevrolet', 'toyota')
regex <- paste0("\\b(?:", paste(vehicles, collapse="|"), ")\\b")

df %>%
    mutate(
        asset_type = case_when(
            grepl(regex, asset_name) ~ 'vehicle',
            <CODE THAT DOES WORK HERE!!!>
            TRUE ~ asset_name
        )
    )

CodePudding user response:

Adapted from this answer:

library(tidyverse)

vehicles <- c('vehicle', 'mazda', 'nissan', 'ford', 'honda', 'chevrolet', 'toyota')
asset_name <- c("2001 honda civic", "2003 nissan altima", "2005 mazda 5", 
                "unmatched1", "unmatched2") # added unmatched strings
x <- 1:length(asset_name) # dummy variable to make df

df <- data.frame(x, asset_name)

df %>% 
  mutate(asset_type = case_when(
    asset_name %in% unlist(lapply(vehicles, grep, asset_name, value = TRUE)) ~ 'vehicle',
    TRUE ~ asset_name)
    )

Output:

  x         asset_name asset_type
1 1   2001 honda civic    vehicle
2 2 2003 nissan altima    vehicle
3 3       2005 mazda 5    vehicle
4 4         unmatched1 unmatched1
5 5         unmatched2 unmatched2
  •  Tags:  
  • Related