Is there a better method of splitting character strings with different lengths and patterns in r?-CodePudding

I have 12 sample names consisting of a long character string, but most sample names are different lengths and all have different sample identifiers (e.g. some are labelled JKT-n and some are labelled sample_n or Sample_n). I want to extract only the sample identifier part which is in the middle of the string, labelled as either "JKT-n", "Sample_n" or "sample_n". I'm having difficulty as the identifiers aren't consistent. Here is an example for 3 of them:

data$sample
 [1] "Monocytes DF 2_E18_016e_20180411_JKT-6_01_normalized_Ungated_viSNE_FlowSOM.fcs"     
 [2] "Monocytes DHF 2_E19_014b_20190731_sample_32_01_normalized_Ungated_viSNE_FlowSOM.fcs"
 [3] "Monocytes DF 2_E19_014b_20190730_Sample_21_01_normalized_Ungated_viSNE_FlowSOM.fcs"

This is the method I've used to split the strings which got me what I wanted in the end. However, I'm wondering if there's a neater way to do this as it's a bit clunky.

data <- as.data.frame(clean_names(read_excel("Significant Citrus clusters.xlsx", )))
data$tmp <- substr(data$sample,34,nchar(data$sample)-40)
data$tmp2 <- gsub(pattern = c("Sample_"), replacement = "JKT-", x=data$tmp)
data$tmp3 <- gsub(pattern=c("_sample_"), replacement="JKT-", x=data$tmp2)
data$tmp4 <- gsub(pattern="_", replacement="", x=data$tmp3)
data$CohortID <- data$tmp4

data$CohortID
 [1] "JKT-6"  "JKT-8"  "JKT-12" "JKT-21" "JKT-26" "JKT-27" "JKT-4"  "JKT-9"  "JKT-22" "JKT-30" "JKT-32" "JKT-33"

Thanks

CodePudding user response：

You can combine your third and fourth lines into one line of code by using a pipe symbol:

    gsub(pattern=c("Sample_|_sample_"), replacement="JKT-", x=data$tmp2)

CodePudding user response：

Alternatively, the digits which appear in the middle of the strings right after _JKT-, _Sample_, or _sample_ independent of the actual position can be extracted by a regular expression with lookbehind (see https://www.regular-expressions.info/lookaround.html for detailed explanations).

paste0("JKT-", stringr::str_extract(data$sample, "(?<=_(JKT-|[sS]ample_))\\d "))

[1] "JKT-6"  "JKT-32" "JKT-21"

For string manipulation, I prefer the stringr package over the base R functions because of it's consistent user interface and function naming.

Sample data

data <- data.frame(sample = c(
  "Monocytes DF 2_E18_016e_20180411_JKT-6_01_normalized_Ungated_viSNE_FlowSOM.fcs",
  "Monocytes DHF 2_E19_014b_20190731_sample_32_01_normalized_Ungated_viSNE_FlowSOM.fcs",
  "Monocytes DF 2_E19_014b_20190730_Sample_21_01_normalized_Ungated_viSNE_FlowSOM.fcs"
))