I have a number of strings (CIGARs) that I am trying to sum the numbers that occur before the number preceding "I". The position that "I" occurs is highly variable but always has a number before it.
Here is a sample df:
df <- data.frame(String = c("220M1I","10I200M","5M2D1I20M","22M5D2M3I5M"))
My desired output looks like:
String Sum_prior
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29
I have a partial solution which can't handle >1 digit numbers prior to "I" which is problematic.
sum_fun <- function(x) {
str_match_all(x, "\\d (?!I)") %>%
unlist() %>%
as.numeric() %>%
sum()
}
then applying to df:
df <- df %>% rowwise() %>% mutate(output = sum_fun(String))
df
String output
<chr> <dbl>
1 220M1I 220 #Good
2 10I200M 201 #The 1 in 10 is being included
3 5M2D1I20M 27 #Don't want last 20 included
4 22M5D2M3I5M 34 #Don't want last 5 included
But I can't figure out how to adapt the regex to ignore all numbers immeadiately prior to "I" and sum all other numbers before "I".
A more advanced example I need (but less important), is to calculate the cumulative number when there is more than one "I" - the first occurrence is as above (output_1), but the second (or more) (output_2) example includes the preceeding "I" number.
df2 <- data.frame(String =c("5M10I200M20I","100M2D3I105M1I10M")
String Output_1 Output_2
1 5M10I200M20I 5 215
2 100M2D3I105M1I10M 102 210
Any help is appreciated.
CodePudding user response:
Here is a base R approach:
df <- data.frame(String = c("220M1I","10I200M","5M2D1I20M","22M5D2M3I5M"))
x <- sub("\\d I.*$", "", df$String)
df$Sum_prior <- sapply(strsplit(x, "\\D"), function(y) sum(as.numeric(y)))
df
String Sum_prior
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29
The strategy here is to first strip off the number followed by I, until the end of the string. Then, we string split on non numeric digits, to generate a vector of string numbers. Finally, we sum those numbers to get the final result.
CodePudding user response:
Another approach is to extract all the numbers followed by characters and sum the numbers before the occurrence of 'I'.
library(dplyr)
library(stringr)
sum_fun <- function(x) {
tmp <- str_match_all(x, "(\\d )[A-Z] ")[[1]]
sum(as.numeric(tmp[, 2])[seq_len(grep('I', tmp[, 1]) - 1)])
}
df %>%
rowwise() %>%
mutate(output = sum_fun(String)) %>%
ungroup
# String output
# <chr> <dbl>
#1 220M1I 220
#2 10I200M 0
#3 5M2D1I20M 7
#4 22M5D2M3I5M 29
CodePudding user response:
Another base R plus stringr approach in one line:
(library(stringr)
df$Sum <- lapply(str_extract_all(sub("\\d I.*$", "", df$String), "\\d "), function(x) sum(as.numeric(x)))
String Sum
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29
This works in steps:
- the first is the
suboperation which gets rid of the single digit plusIplus the rest - the second is the
str_extract_allpart which extracts all remaining digits into a list - the third is the
lapplypart where we perform the mathematical operation on the listed digits
CodePudding user response:
Another option is using dplyr, stringr, and purrr:
library(dplyr)
library(purrr)
library(stringr)
df %>%
# Steps 1 (remove irrelevant part of string) and 2 (extract numbers):
mutate(String_new = str_extract_all(sub("\\d I.*$", "", String), "\\d ")) %>%
# Step 3: convert to numeric and perform calculation:
mutate(String_new = map_dbl(String_new, function(x) sum(as.numeric(x))))
String String_new
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29
