R: "Scanning" Numbers for Decimal Points-CodePudding

I am working with the R programming language.

Suppose I have the following data:

a = c("12234.2434", "gg.546", "45657herg.6657767")

b = a

my_data = data.frame(a,b)

I am trying to create a new variable "c" that contains the first two "entries" before the decimal point and the first two "entries" after the decimal point. If done correctly, this would look something like this: 34.243, gg.546, rg.665

                  a                 b     c
1        12234.2434        12234.2434 34.243
2            gg.546            gg.546 gg.546
3 45657herg.6657767 45657herg.6657767 rg.665

Normally, I would have done this using the substr() function in R - but since the numbers are of different length, the decimal point can be in different positions, thus making the substr() function not very useful in this case.

I know how to solve this problem in Microsoft Excel by using the "text to columns" and "delimited" option by specifying "fixed width delimitation" with the "decimal point" - but I am trying to do this in R using the "dplyr" library.

Can someone please show me how to do this?

Thanks!

CodePudding user response：

From the output it seems you want 2 entries before decimal and 3 entries after decimal point.

You may use sub to extract those values.

sub('.*(.{2}\\..{3}).*', '\\1', my_data$a)
#[1] "34.243" "gg.546" "rg.665"

In dplyr -

library(dplyr)
my_data %>% mutate(c = sub('.*(.{2}\\..{3}).*', '\\1', a))

#                  a                 b      c
#1        12234.2434        12234.2434 34.243
#2            gg.546            gg.546 gg.546
#3 45657herg.6657767 45657herg.6657767 rg.665

CodePudding user response：

Using str_remove with regex lookaround to remove the characters (.*) after the . (\\.) followed by three characters (...) or (|) characters (.*) that precedes two characters and a dot ((?=..\\.))

library(dplyr)
library(stringr)
my_data %>% 
  mutate(c = str_remove_all(a, ".*(?=..\\.)|(?<=\\....).*"))
                  a                 b      c
1        12234.2434        12234.2434 34.243
2            gg.546            gg.546 gg.546
3 45657herg.6657767 45657herg.6657767 rg.665

Regarding the use of substr (or str_sub from stringr), if we know the position of the dot (.), then it can be done. Below code, finds the position with str_locate and use that index to get the substring

my_data %>% 
  mutate(i1 = str_locate(a, fixed("."))[, "start"],
   c = str_sub(a, i1-2, i1   3), i1 = NULL)
                  a                 b      c
1        12234.2434        12234.2434 34.243
2            gg.546            gg.546 gg.546
3 45657herg.6657767 45657herg.6657767 rg.665

The equivalent in base R would be

> i1 <- regexpr(".", my_data$a, fixed = TRUE)
> substr(my_data$a, i1-2, i1 3)
[1] "34.243" "gg.546" "rg.665"