Home > database >  R loops for basic data cleaning
R loops for basic data cleaning

Time:01-16

I'm a bit new to R and programming in general. I have to clean a lot of data, and often it's a similar issue in multiple columns. So, I would like to use a loop, rather than writing out each line of code. I have data similar to this:

black <- c("1.33%", "9.22%", "10.71%")
white <- c("5.23%", "8.12%", "11.72%")
day <- c("Wednesday", "Thursday", "Friday")
blue <- c("2.21%", "1.12%", "8.79%")
df <- data.frame(black, white, day, blue)

This gets me a dataframe like this:

   black  white       day  blue
1  1.33%  5.23% Wednesday 2.21%
2  9.22%  8.12%  Thursday 1.12%
3 10.71% 11.72%    Friday 8.79%

I have read that there are 'for' loops, and also that the apply() family work like loops in R too... How would I loop through the variables black, white and blue (but not day) so that I can:

  • remove the % sign
  • change type from char to numeric
  • round to 1 decimal place?

Like I say, I would like to know how to write this as both a for loop and apply. To remove the % sign I have used mutate and gsub before...

Thanks for your suggestions, particularly helping me to write legible code! Best, Roger

CodePudding user response:

Here is one tidy way using dplyr

library(dplyr)

clean_my_data<-function(input){
   gsub("%", "", input) %>% as.numeric() %>% round(1)
}

df_new<-df %>%
  mutate(across(c(black,white,blue), clean_my_data))

df_new
#>   black white       day blue
#> 1   1.3   5.2 Wednesday  2.2
#> 2   9.2   8.1  Thursday  1.1
#> 3  10.7  11.7    Friday  8.8

Created on 2022-01-15 by the reprex package (v2.0.1)

CodePudding user response:

this is a quick and dirty way of doing it and it can be improved!

First you need a function that do the job then you apply that function (or you do a loop it is up to you).

clean_color <- function(x) {
# just remove the last char, it can fail on data like that "1.38% "
    without_percent = substr(x, 
                           start = 1, 
                           stop = nchar(x) - 1)
# second part convert in mun and round it
    round(as.numeric(without_percent),1)
        }

Then you apply this function:

sapply(df[,c(1:2,4)], clean_color)
  •  Tags:  
  • Related