I have something like this (the real one have 1,292,500 entries, 79 total columns):
| Code |
|---|
| A005 |
| A200 |
| B300 |
| C001 |
| C999 |
| D000 |
| D352 |
| D480 |
| D501 |
| D999 |
| E480 |
And I need create a new column to a group some codes, i was using str_extract to extract codes with only one letter, like A000-A999 i used:
dados$CODE_A <- str_extract(dados$CODE, "(?i)\\b(?:A)\\W*\\d ")
but now I need extract codes between C000-C999 and D000-D499, just like this:
| Code | CODE_X | CODE_Y |
|---|---|---|
| A005 | ||
| A200 | ||
| B300 | ||
| C001 | C001 | |
| C999 | C999 | |
| D000 | D000 | |
| D352 | D352 | |
| D480 | D480 | |
| D501 | D501 | |
| D999 | D999 | |
| E480 |
How i do this?
CodePudding user response:
library(stringr)
library(dplyr)
library(tidyr)
tibble(x = c("A005", "A200", "B300", "C001", "C999", "D000", "D501"))%>%
mutate(letter = str_extract(x, "[A-Z]"),
numbers = as.numeric(str_extract(x, "\\d{3}")),
answer = case_when(letter == "C" ~ x,
letter == "D" & numbers < 500 ~ x))
x letter numbers answer
<chr> <chr> <dbl> <chr>
1 A005 A 5 NA
2 A200 A 200 NA
3 B300 B 300 NA
4 C001 C 1 C001
5 C999 C 999 C999
6 D000 D 0 D000
7 D501 D 501 NA
You could then filter for !is.na(answer) for example
CodePudding user response:
You could also use regex directly like this instead:
C000-C999
dados$CODE_C <- str_extract(dados$CODE, "C[0-9][0-9][0-9]")
D000-C499
dados$CODE_D <- str_extract(dados$CODE, "D[0-4][0-9][0-9]")
