Home > Mobile >  Split string keeping spaces in R
Split string keeping spaces in R

Time:01-19

I would like to prepare a table from raw text using readr::read_fwf. There is an argument col_position responsible for determining columns width which in my case could differ. Table always includes 4 columns and is based on 4 first words from the string like besides one: category variable description value sth

> text_for_column_width = "category    variable   description      value      sth"
> nchar("category    ")
[1] 12
> nchar("variable   ")
[1] 11
> nchar("description      ")
[1] 17
> nchar("value      ")
[1] 11

I want obtain 4 first words but keeping spaces to have category with 8[a-b] 4[spaces] characters and finally create a vector including number of characters for each of four names c(12,11,17,11). I tried using strsplit with space split argument and then calculate existing zeros however I believe there is faster way just using proper regular expression.

CodePudding user response:

A possible solution, using stringr:

library(tidyverse)

text_for_column_width = "category    variable   description      value      sth"

strings <- text_for_column_width %>% 
  str_remove("sth$") %>% 
  str_split("(?<=\\s)(?=\\S)") %>% 
  unlist

strings

#> [1] "category    "      "variable   "       "description      "
#> [4] "value      "

strings %>% str_count

#> [1] 12 11 17 11

CodePudding user response:

You can use utils::strcapture:

text_for_column_width = "category    variable   description      value      sth"
pattern <- "^(\\S \\s )(\\S \\s )(\\S \\s )(\\S \\s*)"
result <- utils::strcapture(pattern, text_for_column_width, list(f1 = character(), f2 = character(), f3 = character(), f4 = character()))
nchar(as.character(as.vector(result[1,])))
## => [1] 12 11 17 11

See the regex demo. The ^(\S \s )(\S \s )(\S \s )(\S \s*) matches

  • ^ - start of string
  • (\S \s ) - Group 1: one or more non-whitespace chars and then one or more whitespaces
  • (\S \s ) - Group 2: one or more non-whitespace chars and then one or more whitespaces
  • (\S \s ) - Group 3: one or more non-whitespace chars and then one or more whitespaces
  • (\S \s*) - Group 4: one or more non-whitespace chars and then zero or more whitespaces
  •  Tags:  
  • Related