I am working with the R programming language.
Suppose I have the following data:
a = c("12234.2434", "gg.546", "45657herg.6657767")
b = a
my_data = data.frame(a,b)
I am trying to create a new variable "c" that contains the first two "entries" before the decimal point and the first two "entries" after the decimal point. If done correctly, this would look something like this: 34.243, gg.546, rg.665
a b c
1 12234.2434 12234.2434 34.243
2 gg.546 gg.546 gg.546
3 45657herg.6657767 45657herg.6657767 rg.665
Normally, I would have done this using the substr() function in R - but since the numbers are of different length, the decimal point can be in different positions, thus making the substr() function not very useful in this case.
I know how to solve this problem in Microsoft Excel by using the "text to columns" and "delimited" option by specifying "fixed width delimitation" with the "decimal point" - but I am trying to do this in R using the "dplyr" library.
Can someone please show me how to do this?
Thanks!
CodePudding user response:
From the output it seems you want 2 entries before decimal and 3 entries after decimal point.
You may use sub to extract those values.
sub('.*(.{2}\\..{3}).*', '\\1', my_data$a)
#[1] "34.243" "gg.546" "rg.665"
In dplyr -
library(dplyr)
my_data %>% mutate(c = sub('.*(.{2}\\..{3}).*', '\\1', a))
# a b c
#1 12234.2434 12234.2434 34.243
#2 gg.546 gg.546 gg.546
#3 45657herg.6657767 45657herg.6657767 rg.665
CodePudding user response:
Using str_remove with regex lookaround to remove the characters (.*) after the . (\\.) followed by three characters (...) or (|) characters (.*) that precedes two characters and a dot ((?=..\\.))
library(dplyr)
library(stringr)
my_data %>%
mutate(c = str_remove_all(a, ".*(?=..\\.)|(?<=\\....).*"))
a b c
1 12234.2434 12234.2434 34.243
2 gg.546 gg.546 gg.546
3 45657herg.6657767 45657herg.6657767 rg.665
Regarding the use of substr (or str_sub from stringr), if we know the position of the dot (.), then it can be done. Below code, finds the position with str_locate and use that index to get the substring
my_data %>%
mutate(i1 = str_locate(a, fixed("."))[, "start"],
c = str_sub(a, i1-2, i1 3), i1 = NULL)
a b c
1 12234.2434 12234.2434 34.243
2 gg.546 gg.546 gg.546
3 45657herg.6657767 45657herg.6657767 rg.665
The equivalent in base R would be
> i1 <- regexpr(".", my_data$a, fixed = TRUE)
> substr(my_data$a, i1-2, i1 3)
[1] "34.243" "gg.546" "rg.665"
