So, I have a vector, species, containing names of different species. Normally, these names are made up of the regular two words, however, ocasionally, some elements of the vector also include the sub-species (i.e. the third word), or sometimes even other alpha-numeric characters.

species<-c("Abia americana", "Abia americana","Abia americana primaria","Abia aurulenta", "Leptomyrmex fragilis","Leptomyrmex fragilis sapana", "Parasinilabeo longiventralis laran", "Rhizorhagium arenosum","Rhizorhagium arenosum","Rhizorhagium arenosum COMPLX. 18","Stelis phaeoptera phaeo")

I want to remove the last word or words (or expressions, as in the case of Rhizorhagium arenosum COMPLX. 18) from each string, but only if the entirety of the rest of the string has at least one duplicate in my species vector.

Basically, my output would be something like this:

c("Abia americana", "Abia americana","Abia americana","Abia aurulenta", "Leptomyrmex fragilis","Leptomyrmex fragilis", "Parasinilabeo longiventralis laran", "Rhizorhagium arenosum","Rhizorhagium arenosum","Rhizorhagium arenosum","Stelis phaeoptera phaeo")

Thank you in advance for any answers.

CodePudding user response：

We would extract the first two words and then test if the extracted words are found in the original vector or not. If not found, revert to using the original vector.

Original answer:

trial <- vapply(strsplit(species, " "), \(x) paste0(x[1:2], collapse = " "), character(1L))
ifelse(trial %in% species, trial, species)

Based on @Onyambu's suggestion:

trial <- sub('^(\\S \\s\\S ).*', "\\1", species)
ifelse(trial %in% species, trial, species)

Output

 [1] "Abia americana"                     "Abia americana"                     "Abia americana"                    
 [4] "Abia aurulenta"                     "Leptomyrmex fragilis"               "Leptomyrmex fragilis"              
 [7] "Parasinilabeo longiventralis laran" "Rhizorhagium arenosum"              "Rhizorhagium arenosum"             
[10] "Rhizorhagium arenosum"              "Stelis phaeoptera phaeo"

CodePudding user response：

Another possible solution, based on stringr::str_split and purrr::map_chr:

library(tidyverse)

species %>% 
  str_split(" ") %>% 
  map_chr(~ str_c(.x[[1]], .x[[2]], sep=" ")) %>% 
  if_else(. %in% species, ., species)

#>  [1] "Abia americana"                     "Abia americana"                    
#>  [3] "Abia americana"                     "Abia aurulenta"                    
#>  [5] "Leptomyrmex fragilis"               "Leptomyrmex fragilis"              
#>  [7] "Parasinilabeo longiventralis laran" "Rhizorhagium arenosum"             
#>  [9] "Rhizorhagium arenosum"              "Rhizorhagium arenosum"             
#> [11] "Stelis phaeoptera phaeo"

CodePudding user response：

Here's a solution in the tidyverse, which maps each word to the shortest word that starts the same way.

Solution

First import the tidyverse and generate the species vector.

library(tidyverse)


# ...
# Code to generate 'species' vector.
# ...

Then use the following workflow to map each word in species to its simplest duplicate; this effectively "removes" the "rest of the string" from each word.

species <- species %>%
  # Get every combination of words within the 'species' vector.
  expand_grid(
    First = unique(.),
    Next = unique(.)
  ) %>%
  # Flag each combination where one word "matches" another: the one is a substring of
  # the other and starts the same way. 
  mutate(
    First_in_Next = str_starts(Next,  fixed(First)),
    Next_in_First = str_starts(First, fixed(Next))
  ) %>%
  # Ignore nonmatches.
  filter(First_in_Next | Next_in_First) %>%
  # Ensure each combination maps from its longER word to its shortER.
  mutate(
    From = pmax(First, Next),
    To = pmin(First, Next)
  ) %>%
  # Map each word to its shortEST match across all combinations.
  group_by(From) %>% summarize(To = min(To)) %>%
  # Apply the mapping to the 'species' vector.
  inner_join(
    tibble(Species = species),
    by = c(From = "Species")
  ) %>%
  # Extract the mapped results as a new vector.
  .$To


# View the result.
species

Result

Given a species vector like the one you reproduced in your question

species <- c(
  "Abia americana",
  "Abia americana",
  "Abia americana primaria",
  "Abia aurulenta",
  "Leptomyrmex fragilis",
  "Leptomyrmex fragilis sapana",
  "Parasinilabeo longiventralis laran",
  "Rhizorhagium arenosum",
  "Rhizorhagium arenosum",
  "Rhizorhagium arenosum COMPLX. 18",
  "Stelis phaeoptera phaeo"
)

this workflow should yield the following result for species

 [1] "Abia americana"                     "Abia americana"                    
 [3] "Abia americana"                     "Abia aurulenta"                    
 [5] "Leptomyrmex fragilis"               "Leptomyrmex fragilis"              
 [7] "Parasinilabeo longiventralis laran" "Rhizorhagium arenosum"             
 [9] "Rhizorhagium arenosum"              "Rhizorhagium arenosum"             
[11] "Stelis phaeoptera phaeo"

just as you requested.

Note

Because it can identify arbitrary substrings, this solution is more extensible than others that split by a delimiter (like " ") to match by only the first two words.

Nonetheless, if you include a string like "Abia" (ie. only the Genus) in species, then "Abia americana primaria" would get (wrongly) mapped to "Abia" rather than to "Abia americana". So make sure your species data is clean before implementing this solution.