How can I remove strings that are not succeeded by numbers?
For example, I am working with string data like the one below:
String <- c("NA; ab 1917; ajr 69; sb 700; sb 703; scarl m; ab 1672 a",
"ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a",
"ab 3 16 31 41 1134 1206 abuht",
"ab 479 862 984 1626 asc")
df <- data.frame(String)
I would like the output to look like the following
Output <- c("NA; ab 1917; ajr 69; sb 700; sb 703;; ab 1672",
"ab 18 sb 5 ab 1433 ab 1129 ab 184 ab 473",
"ab 3 16 31 41 1134 1206",
"ab 479 862 984 1626")
df <- data.frame(String, Output)
Thank you so much for your help!
CodePudding user response:
Sorry I couldn't add my comment so I wrote my insufficient code here.
I agree with Chris's opinion.
I focused on "Output"'s first line and tried using ";" as the separator.
If you want to add the separator " "(white space), just modify the code.
String <- c("NA; ab 1917; ajr 69; sb 700; sb 703; scarl m; ab 1672 a",
"ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a",
"ab 3 16 31 41 1134 1206 abuht",
"ab 479 862 984 1626 asc")
res<-c()
for(str in String){
hoge<-strsplit(str, ";")[[1]]
res<-c(res, paste(hoge[grep("\\d|NA", hoge)], collapse=";"))
}
# ** this result is insufficient **
data.frame(res)
res
1 NA; ab 1917; ajr 69; sb 700; sb 703; ab 1672 a
2 ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a
3 ab 3 16 31 41 1134 1206 abuht
4 ab 479 862 984 1626 asc
If you Edit your question, kind contributers will help you I think.
CodePudding user response:
First let's determine the regex:
succeed_num_regex = "(( )?. [0-9] ) "
The meaning:
( )?: we allow (but don't require) a space at the beginning.: some amount of free text (this is the "string" that is to be succeeded by a number): there must be a space after the string[0-9]: this is the numberThe whole thing is enclosed in
(), meaning that we are looking for this pattern to repeat one or more times.
Now we can put this in code:
library(tidyverse)
String %>%
str_split("; ") %>%
map(map_chr, str_extract, pattern = succeed_num_regex) %>%
# Strings that did not have this pattern at all will be NA
# We replace them here with ""
map(map_chr, function(x) ifelse(is.na(x), "", x)) %>%
# Put it all back together
map_chr(paste, collapse = "; ")
[1] "; ab 1917; ajr 69; sb 700; sb 703; ; ab 1672"
[2] "ab 18 sb 5 ab 1433; ab 1129 ab 184 ab 473"
[3] "ab 3 16 31 41 1134 1206"
[4] "ab 479 862 984 1626"
Some notes:
In your output, you kept
"NA"instead of it getting replaced with"", which is what later happened to"scarl m". This can be added as a rule to the solution, but for now I did not add it because it is not consistent with your requirements.In your output, the second result
"ab 18 sb 5 ab 1433 ab 1129 ab 184 ab 473"is missing a semi-colon after1433. If that was not a mistake, then please explain why.In your output, we have
sb 703;;whereas my output hassb 703; ;. This is to be consistent that the results are pasted with"; ". Let me know if this is problematic (I left it as is since that isn't a clear requirement either).
CodePudding user response:
Using an ide like vscode or notepad , I can use this to match (\s)([a-z][a-z][a-z] ) and replace with this$1.
Your need is confusing as according to your output, 'ajr' is not supposed to be matched meanwhile 'asc' is matched. My hack above matches both 'ajr' and 'asc'.
A breakdown of my hack is:
(\s)matches the space before the group of letters. I noticed that you want to match only group of letters found after a space.([a-z][a-z][a-z] )matches groups of letters greater than 2 (as I noticed that you do not want to match 2 letter groups).$1replaces the match with the nothing.
I hope it helps. You can take this and translate it into the programming language you are using and there.
