I'm sure there's a painfully easy solution to this, but given I'm new to R I'm a bit stumped.

I have a large dataset with the data structured accordingly.

  v1 
1  US2  
2  L1_US24  
3  US2_0   
4  US24
5  US245
6  US245
7  US24 L
8  US3

What I'd like to do is create a categorisation column dependent upon the values in v1 like so:

  v1          Cat 
1  US2         1  
2  L1_US24     2  
3  US2_0       1  
4  US24        2  
5  US245       3  
6  US245       3
7  US24 L      2
8  US3         4

Now if it was a binary choice it would be quite easy for I could use 'grepl' with 'ifelse' to assign the values accordingly. However I'm unsure whether that is an efficient way of doing it in a large dataset where the same values are contained in the columns.

Can anyone provide some advice on how to achieve the desired result?

CodePudding user response：

Please find a more general solution that should answer the different cases you encounter.

Reprex

Solution with `Base R` only

Code

# Extract codes 'USXXX'
code <- regmatches(df$V1, regexpr("US\\d ", df$V1))

# Convert codes into numeric categories and add the in the 'Cat' column
df$Cat <- as.numeric(factor(code, levels = unique(code)))

Output

df
#>        V1 Cat
#> 1     US2   1
#> 2 L1_US24   2
#> 3   US2_0   1
#> 4    US24   2
#> 5   US245   3
#> 6   US245   3
#> 7  US24 L   2
#> 8     US3   4

Solution using `stringr`

Code

# Extract codes 'USXXX'
code <- stringr::str_extract(df$V1, "US\\d ")

# Convert codes into numeric categories and add them in the 'Cat' column
df$Cat <- as.numeric(factor(code, levels = unique(code)))

Output

df
#>        V1 Cat
#> 1     US2   1
#> 2 L1_US24   2
#> 3   US2_0   1
#> 4    US24   2
#> 5   US245   3
#> 6   US245   3
#> 7  US24 L   2
#> 8     US3   4

Data

df <- data.frame(V1 = c("US2", "L1_US24", "US2_0", "US24", "US245", "US245", "US24 L", "US3"))

^{Created on 2022-02-04 by the reprex package (v2.0.1)}

CodePudding user response：

You can convert to factor and then to numeric:

df$Cat <- as.numeric(factor(df$v1, levels = unique(df$v1)))

df
     v1 Cat
1   US2   1
2  US24   2
3   US2   1
4  US24   2
5 US245   3
6 US245   3
7 US243   4
8   US3   5

Reprex

Solution with Base R only

Solution using stringr

Solution with `Base R` only

Solution using `stringr`