I'm sure there's a painfully easy solution to this, but given I'm new to R I'm a bit stumped.
I have a large dataset with the data structured accordingly.
v1
1 US2
2 L1_US24
3 US2_0
4 US24
5 US245
6 US245
7 US24 L
8 US3
What I'd like to do is create a categorisation column dependent upon the values in v1 like so:
v1 Cat
1 US2 1
2 L1_US24 2
3 US2_0 1
4 US24 2
5 US245 3
6 US245 3
7 US24 L 2
8 US3 4
Now if it was a binary choice it would be quite easy for I could use 'grepl' with 'ifelse' to assign the values accordingly. However I'm unsure whether that is an efficient way of doing it in a large dataset where the same values are contained in the columns.
Can anyone provide some advice on how to achieve the desired result?
CodePudding user response:
Please find a more general solution that should answer the different cases you encounter.
Reprex
Solution with Base R only
- Code
# Extract codes 'USXXX'
code <- regmatches(df$V1, regexpr("US\\d ", df$V1))
# Convert codes into numeric categories and add the in the 'Cat' column
df$Cat <- as.numeric(factor(code, levels = unique(code)))
- Output
df
#> V1 Cat
#> 1 US2 1
#> 2 L1_US24 2
#> 3 US2_0 1
#> 4 US24 2
#> 5 US245 3
#> 6 US245 3
#> 7 US24 L 2
#> 8 US3 4
Solution using stringr
- Code
# Extract codes 'USXXX'
code <- stringr::str_extract(df$V1, "US\\d ")
# Convert codes into numeric categories and add them in the 'Cat' column
df$Cat <- as.numeric(factor(code, levels = unique(code)))
- Output
df
#> V1 Cat
#> 1 US2 1
#> 2 L1_US24 2
#> 3 US2_0 1
#> 4 US24 2
#> 5 US245 3
#> 6 US245 3
#> 7 US24 L 2
#> 8 US3 4
- Data
df <- data.frame(V1 = c("US2", "L1_US24", "US2_0", "US24", "US245", "US245", "US24 L", "US3"))
Created on 2022-02-04 by the reprex package (v2.0.1)
CodePudding user response:
You can convert to factor and then to numeric:
df$Cat <- as.numeric(factor(df$v1, levels = unique(df$v1)))
df
v1 Cat
1 US2 1
2 US24 2
3 US2 1
4 US24 2
5 US245 3
6 US245 3
7 US243 4
8 US3 5
