Home > Software engineering >  Assigning new label / group by partial string matching with vector of shortened labels
Assigning new label / group by partial string matching with vector of shortened labels

Time:11-14

I am trying to group data together in R. I'm using data from a Tidy Tuesday challenge (global seafood, stock), and want to group the data into oceans. Currently, the data is separated into ocean segments (e.g Eastern Central Atlantic and northeast central Atlantic)

   Ocean                      code   year    bio_sus    bio_nonsus
 1 Eastern Central Atlantic   NA     2015    57.1       42.9
 2 Eastern Central Atlantic   NA     2017    57.1       42.9
 3 Southeast Central Atlantic NA     2015    67.6       32.4
 4 Southeast Central Atlantic NA     2017    67.6       32.4

Is there a way to combine the different ocean data (the bio_sus and bio_nonsus) into one larger bit of data (e.g all the segments of Atlantic into one Atlantic for 2015,2017).

I have four different oceans in total: Pacific, Atlantic, Indian and Mediterranean that are segmented like this

#This is the data: 

stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')

CodePudding user response:

Why not to use str_split() of the stringr package to extract the ocean and make a column just for the ocean and one for the sub-segment?

CodePudding user response:

This is essentially a "multiple partial strings matching" problem. Here one approach. Loop over your partial strings to get the indices for each partial match, then replace the original vector with the matches. Then summarise by your new column.

library(dplyr)

stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')

oceans <- c("pacific", "atlantic", "indian", "mediterranean")
  
lu <- stack(sapply(oceans, grep, x = stock$Entity, ignore.case = TRUE))

stock$oceans <-  stock$Entity
stock$oceans[lu$values] <- as.character(lu$ind)

stock %>%
  group_by(oceans) %>%
  summarise(across(matches("^share"), sum))
#> # A tibble: 5 × 3
#>   oceans        `Share of fish stocks within biologi… `Share of fish stocks tha…
#>   <chr>                                         <dbl>                      <dbl>
#> 1 atlantic                                      742.                        458.
#> 2 indian                                        277.                        123.
#> 3 mediterranean                                  75.3                       125.
#> 4 pacific                                       894.                        306.
#> 5 World                                        1609.                        491.

Created on 2021-11-13 by the reprex package (v2.0.1)

  • Related