Sum numbers that occur before character in a string except the immediately preceding number-CodePudding

I have a number of strings (CIGARs) that I am trying to sum the numbers that occur before the number preceding "I". The position that "I" occurs is highly variable but always has a number before it. Here is a sample df:

df <- data.frame(String = c("220M1I","10I200M","5M2D1I20M","22M5D2M3I5M"))

My desired output looks like:

       String Sum_prior
1      220M1I       220
2     10I200M         0
3   5M2D1I20M         7
4 22M5D2M3I5M        29

I have a partial solution which can't handle >1 digit numbers prior to "I" which is problematic.

    sum_fun <- function(x) {
  str_match_all(x, "\\d (?!I)") %>% 
    unlist() %>% 
    as.numeric() %>% 
    sum()
}

then applying to df:

df <- df %>% rowwise() %>% mutate(output = sum_fun(String))
df



  String      output
  <chr>        <dbl>
1 220M1I         220 #Good
2 10I200M        201 #The 1 in 10 is being included
3 5M2D1I20M       27 #Don't want last 20 included
4 22M5D2M3I5M     34 #Don't want last 5 included

But I can't figure out how to adapt the regex to ignore all numbers immeadiately prior to "I" and sum all other numbers before "I".

A more advanced example I need (but less important), is to calculate the cumulative number when there is more than one "I" - the first occurrence is as above (output_1), but the second (or more) (output_2) example includes the preceeding "I" number.

df2 <- data.frame(String =c("5M10I200M20I","100M2D3I105M1I10M")


             String Output_1 Output_2
1      5M10I200M20I        5      215
2 100M2D3I105M1I10M      102      210

Any help is appreciated.

CodePudding user response：

Here is a base R approach:

df <- data.frame(String = c("220M1I","10I200M","5M2D1I20M","22M5D2M3I5M"))
x <- sub("\\d I.*$", "", df$String)
df$Sum_prior <- sapply(strsplit(x, "\\D"), function(y) sum(as.numeric(y)))
df

       String Sum_prior
1      220M1I       220
2     10I200M         0
3   5M2D1I20M         7
4 22M5D2M3I5M        29

The strategy here is to first strip off the number followed by I, until the end of the string. Then, we string split on non numeric digits, to generate a vector of string numbers. Finally, we sum those numbers to get the final result.

CodePudding user response：

Another approach is to extract all the numbers followed by characters and sum the numbers before the occurrence of 'I'.

library(dplyr)
library(stringr)

sum_fun <- function(x) {
  tmp <- str_match_all(x, "(\\d )[A-Z] ")[[1]]
  sum(as.numeric(tmp[, 2])[seq_len(grep('I', tmp[, 1]) - 1)])
}

df %>% 
  rowwise() %>% 
  mutate(output = sum_fun(String)) %>%
  ungroup

#  String      output
#  <chr>        <dbl>
#1 220M1I         220
#2 10I200M          0
#3 5M2D1I20M        7
#4 22M5D2M3I5M     29

CodePudding user response：

Another base R plus stringr approach in one line:

(library(stringr)
df$Sum <- lapply(str_extract_all(sub("\\d I.*$", "", df$String), "\\d "), function(x) sum(as.numeric(x)))
       String Sum
1      220M1I 220
2     10I200M   0
3   5M2D1I20M   7
4 22M5D2M3I5M  29

This works in steps:

the first is the suboperation which gets rid of the single digit plus Iplus the rest
the second is the str_extract_all part which extracts all remaining digits into a list
the third is the lapply part where we perform the mathematical operation on the listed digits

CodePudding user response：

Another option is using dplyr, stringr, and purrr:

library(dplyr)
library(purrr)
library(stringr)                                   
df %>%
  # Steps 1 (remove irrelevant part of string) and 2 (extract numbers): 
  mutate(String_new = str_extract_all(sub("\\d I.*$", "", String), "\\d ")) %>%
  # Step 3: convert to numeric and perform calculation:
  mutate(String_new = map_dbl(String_new, function(x) sum(as.numeric(x))))
       String String_new
1      220M1I        220
2     10I200M          0
3   5M2D1I20M          7
4 22M5D2M3I5M         29