Home > Enterprise >  Issue while splitting a column of a data frame into different columns
Issue while splitting a column of a data frame into different columns

Time:01-10

This the sample of the dataframe I'm playing with.

structure(list(Company.Name = c("Ample Softech System", "Ziff Davis LLC", 
"IIM Kozhikkode", "Perennial", "Irupar Sociedad Cooperativa", 
"md", ""), Job.Title = c("Data Analyst", "Data Analyst", "Data Analyst", 
"Data Analyst", "Data Analyst", "Data Analyst", "Data Analyst"
), Salaries.Reported = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), Location = c("Pune", 
"Pune", "Pune", "Pune", "Pune", "Pune", "Pune"), Salary = c("₹35,563/mo", 
"₹5,21,474/yr", "₹7,64,702/yr", "₹16,123/mo", "₹6,04,401/yr", 
"AFN 1,56,179/yr", "₹23,500/mo")), row.names = 2274:2280, class = "data.frame")

The column Salary contains figures in a pattern of (Currency_symbol Figure periodicity) eg:₹35,563/mo

I have been trying to separate the pattern into different columns. I used the following code.

smpl = separate(sample, col = Salary, into = c( "Currency_symbol", "Salary_copy"), sep = 1, remove = TRUE, convert =  TRUE) #separates currency_symbol into separate column
smpl
smpl2 = separate(smpl, col = Salary_copy, into = c('Salary_copy', 'Periodicity'), sep = -3, remove = TRUE, convert = TRUE) # separates periodicity to separate column
smpl2

The issue I'm facing is that one row contains 3 characters as Currency-symbol (AFN), while others are single characters.

So these particular lines of code mentioned above are not able to separate the pattern into respective columns for that particular row.

If I change the index of sep argument of the code, all other rows will be affected. How can I solve this particular issue?

CodePudding user response:

A possible solution:

library(tidyverse)

df %>% 
  separate(Salary, sep="((?<=^\\D)(?=\\d))|((?<=\\D)\\s)", into=str_c("col", 1:2)) %>% 
  separate(col2, sep = "/", into = str_c("col",2:3))

#>                     Company.Name    Job.Title Salaries.Reported Location col1
#> 2274        Ample Softech System Data Analyst                 1     Pune    ₹
#> 2275              Ziff Davis LLC Data Analyst                 1     Pune    ₹
#> 2276              IIM Kozhikkode Data Analyst                 1     Pune    ₹
#> 2277                   Perennial Data Analyst                 1     Pune    ₹
#> 2278 Irupar Sociedad Cooperativa Data Analyst                 1     Pune    ₹
#> 2279                          md Data Analyst                 1     Pune  AFN
#> 2280                             Data Analyst                 1     Pune    ₹
#>          col2 col3
#> 2274   35,563   mo
#> 2275 5,21,474   yr
#> 2276 7,64,702   yr
#> 2277   16,123   mo
#> 2278 6,04,401   yr
#> 2279 1,56,179   yr
#> 2280   23,500   mo

CodePudding user response:

Another solution using extract and a simpler regex. An additional step trims whitespace and removes the commas from the salary amounts.

df2 <- df %>% 
  extract(Salary, c('currency', 'amount', 'period'), '^(\\D )([0-9,] )/(.*)') %>% 
  mutate(
    currency = gsub(' ', '', currency),
    amount = as.numeric(gsub(',', '', amount))
  )

                    Company.Name    Job.Title Salaries.Reported Location currency amount period
2274        Ample Softech System Data Analyst                 1     Pune        ₹  35563     mo
2275              Ziff Davis LLC Data Analyst                 1     Pune        ₹ 521474     yr
2276              IIM Kozhikkode Data Analyst                 1     Pune        ₹ 764702     yr
2277                   Perennial Data Analyst                 1     Pune        ₹  16123     mo
2278 Irupar Sociedad Cooperativa Data Analyst                 1     Pune        ₹ 604401     yr
2279                          md Data Analyst                 1     Pune      AFN 156179     yr
2280                             Data Analyst                 1     Pune        ₹  23500     mo
  •  Tags:  
  • Related