If I have a df and I want to separate player by sep="_". Is it a smart way that I can do this without manually check how many col I will need? For my example, it is easy to tell I need to separate player into 3 new cols. What is I have 100 rows. Is it a way to tell how many cols I need to separate player into? or is it a way to split player without knowing this info?
df <- data.frame(player=c('John_Wall', 'Dirk_Nowitzki', 'Steve_Nash_try'),
points=c(22, 29, 18),
assists=c(8, 4, 15))
## current method:
df %>% separate(player, c('v1', 'v2', 'v3'), sep ="_")
CodePudding user response:
We may use str_count to find the count of _, get the max count, convert to sequence, paste (str_c) with 'v' to create the column names. Add remove = FALSE, if the original column needs to kept
library(tidyr)
library(dplyr)
library(stringr)
df %>%
separate(player, str_c('v',
seq_len(max(str_count(.$player, '_') 1))), sep = "_", fill = "right")
-output
v1 v2 v3 points assists
1 John Wall <NA> 22 8
2 Dirk Nowitzki <NA> 29 4
3 Steve Nash try 18 15
Or instead of using separate, try with cSplit, which does automatically detect and create those columns
library(splitstackshape)
cSplit(df, 'player', sep = '_')
points assists player_1 player_2 player_3
<num> <num> <char> <char> <char>
1: 22 8 John Wall <NA>
2: 29 4 Dirk Nowitzki <NA>
3: 18 15 Steve Nash try
If we want to use str_extract_all, it returns a list column, which can be converted to columns with unnest_wider
library(tidyr)
df %>%
mutate(v = str_extract_all(player, "[^_] ")) %>%
unnest_wider(v, names_sep = "_")
# A tibble: 3 × 6
player points assists v_1 v_2 v_3
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 John_Wall 22 8 John Wall <NA>
2 Dirk_Nowitzki 29 4 Dirk Nowitzki <NA>
3 Steve_Nash_try 18 15 Steve Nash try
Or using base R
cbind(df, read.table(text = df$player, sep = "_", fill = TRUE,
header = FALSE, na.strings = ""))
player points assists V1 V2 V3
1 John_Wall 22 8 John Wall <NA>
2 Dirk_Nowitzki 29 4 Dirk Nowitzki <NA>
3 Steve_Nash_try 18 15 Steve Nash try
CodePudding user response:
Without packages you could adapt the `length<-` by maximum lengths.
strsplit(df$player, '_') |>
{\(.) t(sapply(., `length<-`, max(lengths(.))))} () |>
cbind(df[setdiff(names(df), 'player')])
# 1 2 3 points assists
# 1 John Wall <NA> 22 8
# 2 Dirk Nowitzki <NA> 29 4
# 3 Steve Nash try 18 15
