I have a large data frame in R with over 200 mostly character variables that I would like to add factors for. I have prepared all levels and labels in an separate data frame. For a certain variable Var1, the corresponding levels and labels are Var1_v and Var1_b, for example for the variable Gender the levels and labels are named Gender_v and Gender_l.
Here is an example of my data:
df <- data.frame (Gender = c("2","2","1","2"),
AgeG = c("3","1","4","2"))
fct <- data.frame (Gender_v = c("1", "2"),
Gender_b = c("Male", "Female"),
AgeG_v = c("1","2","3","4"),
AgeG_b = c("<25","25-60","65-80",">80"))
df$Gender <- factor(df$Gender, levels = fct$Gender_v, labels = fct$Gender_b, exclude = NULL)
df$AgeG <- factor(df$AgeG, levels = fct$AgeG_v, labels = fct$AgeG_b, exclude = NULL)
Is there away to automatize the process, so that the factors (levels and labels) are applied to corresponding variables without having me doing every single one individually?
I think it's done through a function probebly with pmap.
My goal is minimize the effort needed for this process. Is there a better way to prepare the labels and levels as well?
Help is much appreciated.
CodePudding user response:
I solved it with a simple refactoring of your code, automatizing thought a loop. The more data you add, the better your time spent. I believe this fct[[paste0(names(df[i]),"_v")]] can be refactored in an small function to look even better
> df <- data.frame (Gender = c("2","2","1","2"),
AgeG = c("3","1","4","2"))
>
> fct <- data.frame (Gender_v = c("1", "2"),
Gender_b = c("Male", "Female"),
AgeG_v = c("1","2","3","4"),
AgeG_b = c("<25","25-60","65-80",">80"))
>
> for(i in 1:ncol(df)){
le <- fct[[paste0(names(df[i]),"_v")]]
la <- fct[[paste0(names(df[i]),"_b")]]
df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
}
>
> df
Gender AgeG
1 Female 65-80
2 Female <25
3 Male >80
4 Female 25-60
>
Edit: Here is the if condition added
> df <- data.frame (Gender_f = c("2","2","1","2"),
AgeG_f = c("3","1","4","2"),
AgeN = c(70,15,96,30))
>
> fct <- data.frame (Gender_v = c("1", "2"),
Gender_b = c("Male", "Female"),
AgeG_v = c("1","2","3","4"),
AgeG_b = c("<25","25-60","65-80",">80"))
>
> for(i in 1:ncol(df)){
if(endsWith(names(df[i]),"_f")){
name <- str_remove(names(df[i]),"_f")
le <- fct[[paste0(name,"_v")]]
la <- fct[[paste0(name,"_b")]]
df[,i] <- factor(df[,i],levels = le ,labels = la,exclude = NULL)
}
}
>
> df
Gender_f AgeG_f AgeN
1 Female 65-80 70
2 Female <25 15
3 Male >80 96
4 Female 25-60 30
>
CodePudding user response:
A data frame is not really an appropriate data structure for storing the factor level definitions in: there’s no reason to expect all factors to have an equal amount of levels. Rather, I’d just use a plain list, and store the level information more compactly as named vectors, along these lines:
df <- data.frame(
Gender = c("2", "2", "1", "2"),
AgeG = c("3", "1", "4", "2")
)
value_labels <- list(
Gender = c("Male" = 1, "Female" = 2),
AgeG = c("<25" = 1, "25-60" = 2, "65-80" = 3, ">80" = 4)
)
Then you can make a function that uses that data structure to make factors in a data frame:
make_factors <- function(data, value_labels) {
for (var in names(value_labels)) {
if (var %in% colnames(data)) {
vl <- value_labels[[var]]
data[[var]] <- factor(
data[[var]],
levels = unname(vl),
labels = names(vl)
)
}
}
data
}
make_factors(df, value_labels)
#> Gender AgeG
#> 1 Female 65-80
#> 2 Female <25
#> 3 Male >80
#> 4 Female 25-60
