Home > Software design >  Programmatically rename dataframe columns using abbreviations derived from labels
Programmatically rename dataframe columns using abbreviations derived from labels

Time:01-14

I have a labelled dataframe with arbitrary column names, and I would like to rename the columns in a non-arbitrary way using the labels.

Here is a simplified version of the dataframe:

library(labelled)
library(tidyverse)

df <- tibble(id = "a", B101 = 1, B102 = 2, B103 = 3, B104 = .1)

For the columns to be renamed, each label has two to three components (delimited by a colon single space):

var_label(df) <-
  list(
    id = "ID",
    B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
    B102 = "Estimates: $10,000 to $19,999: 20.0 to 24.9 percent",
    B103 = "Estimates: $10,000 to $19,999",
    B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
  )

So, the label for each column may have two components (e.g., B103) or three components (e.g., B102). If the label has none of the three components (e.g., id), then the column need not be renamed.

I would like to abbreviate the components of the labels as follows:

  • Component 1
    • "Estimates: " -> e
    • "Margins of error: " -> m
  • Component 2
    • "Less than $10,000: " or "Less than $10,000" -> i0to9
    • "$10,000 to $19,999: " or "$10,000 to $19,999" -> i10to19
  • Component 3
    • "Less than 20.0 percent" -> p0to19
    • "20.0 to 24.9 percent" -> p20to24

Then, I would like rename each variable by concatenating the components, which would be separated by underscores. Obviously, it's trivial to do this by hand in an ad hoc way:

df %>% 
  rename(e_i0to9_p0to19 = B101,
         e_i10to19_p20to24 = B102,
         e_i10to19 = B103, 
         m_i0to9_p0to19 = B104)

But how can I accomplish this programmatically using tidyverse principles and packages?

CodePudding user response:

Here is a slightly lengthy solution, with the goal of being highly flexible for any changes to the structure or value of your mappings. I recommend the other great answers already given here if your problem is a one-time-only sort of thing. I go over the benefits of this solution at the end.

First define your mappings in a table - this allows you to easily change them in the future or add new ones if necessary:

library(tidyverse)
labels = list(
  B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
  B102 = "Estimates: Less than $10,000: 20.0 to 24.9 percent",
  B103 = "Estimates: $10,000 to $19,999",
  B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
)

components = tribble(
  ~ id, ~ name, ~ new_name,
  1, "Estimates", "e",
  1, "Margins of error", "m",
  2, "Less than $10,000", "i0to9",
  2, "$10,000 to $19,999", "i10to19",
  3, "Less than 20.0 percent", "p0to19",
  3, "20.0 to 24.9 percent", "p20to24"
)

From this we can generate a regex:

component_regex = components %>%
  split(.$id) %>%
  # Fix dollar signs
  map(~ str_replace_all(.x$name, "\\$", "\\\\$")) %>%
  # Include a regex condition for the possibly of there being a colon
  map(~ map_chr(.x, paste0, "[\\:]?")) %>%
  map_chr(paste, collapse = "|") %>%
  # Some components may not be present
  paste0("(", ., ")?") %>%
  # Spaces in between each component
  paste(collapse = "[ ]?")

Here is the regex:

component_regex
#> [1] "(Estimates[\\:]?|Margins of error[\\:]?)?[ ]?(Less than \\$10,000[\\:]?|\\$10,000 to \\$19,999[\\:]?)?[ ]?(Less than 20.0 percent[\\:]?|20.0 to 24.9 percent[\\:]?)?"

Now we extract the components from each label to create a data frame:

data_labels = labels %>% 
  map(str_match, pattern = component_regex) %>%
  map(as.data.frame) %>% 
  reduce(bind_rows) %>%
  select(-V1) %>%
  map_df(str_replace, pattern = ":$", replacement = "") %>%
  mutate(col_name = names(labels))

# A tibble: 4 x 4
  V2               V3                 V4                     col_name
  <chr>            <chr>              <chr>                  <chr>   
1 Estimates        Less than $10,000  Less than 20.0 percent B101    
2 Estimates        Less than $10,000  20.0 to 24.9 percent   B102    
3 Estimates        $10,000 to $19,999 NA                     B103    
4 Margins of error Less than $10,000  Less than 20.0 percent B104    

Now we transform this table so that we could join the components table from before and extract the new name. I'll first show the partial result so you can see what's going on:

data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9] ")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name"))

# A tibble: 11 x 4
   col_name    id value                  new_name
   <chr>    <dbl> <chr>                  <chr>   
 1 B101         1 Estimates              e       
 2 B101         2 Less than $10,000      i0to9   
 3 B101         3 Less than 20.0 percent p0to19  
 4 B102         1 Estimates              e       
 5 B102         2 Less than $10,000      i0to9   
 6 B102         3 20.0 to 24.9 percent   p20to24 
 7 B103         1 Estimates              e       
 8 B103         2 $10,000 to $19,999     i10to19 
 9 B104         1 Margins of error       m       
10 B104         2 Less than $10,000      i0to9   
11 B104         3 Less than 20.0 percent p0to19  

Note that the inner_join() makes it so that cases where there was no third component are omitted from the data. Here's how to finish it:

new_names = data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9] ")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name")) %>%
  group_by(col_name) %>%
  summarise(final_name = paste(new_name[sort(id)], collapse = "_"))

# A tibble: 4 x 2
  col_name final_name     
  <chr>    <chr>          
1 B101     e_i0to9_p0to19 
2 B102     e_i0to9_p20to24
3 B103     e_i10to19      
4 B104     m_i0to9_p0to19 

We now just replace the names with the new ones:

old_names = intersect(names(df), new_names$col_name)
df %>% 
  rename_with(
    ~ new_names$final_name[which(old_names == .x)], 
    .cols = all_of(old_names)
  )
# A tibble: 1 x 5
  id    e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
  <chr>          <dbl>           <dbl>     <dbl>          <dbl>
1 a                  1               2         3            0.1

This solution may seem lengthy, but here are some benefits of it:

  • The mapping can be stored in a CSV file and modified outside of the code. That is, the code does not actually depend on your mapping.
  • You can add or remove parts of each component.
  • It works whether any of the components are missing.
  • It works with more than three components.

CodePudding user response:

We could modify the attributes of the dataframe directly with var_label(df) from labelled package, as it is already used.

You will get a list. Then you can iterate over the list with map. I am not very happy about repeating map(., ... but at this time I have no idea how to apply the DRY (do not repeat yourself) idea:

library(tidyverse)
library(labelled)

colnames(df) <- var_label(df) %>% 
  map(., ~str_replace(., "Estimates:", "e")[[1]]) %>% 
  map(., ~str_replace(., "Margins of error:", "m")[[1]]) %>% 
  map(., ~str_replace(., "Less than \\$10,000\\:", "i0to9")[[1]]) %>% 
  map(., ~str_replace(., "\\$10,000 to \\$19,999", "i10to19")[[1]]) %>% 
  map(., ~str_replace(., "Less than 20.0 percent", "p0to19")[[1]]) %>% 
  map(., ~str_replace(., "20.0 to 24.9 percent", "p20to24")[[1]]) %>% 
  map(., ~str_replace_all(., " ", "_")[[1]]) %>% 
  map(., ~str_replace_all(., ":", "")[[1]])

  e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
           <dbl>           <dbl>     <dbl>          <dbl>
1              1               2         3            0.1

CodePudding user response:

Here is one way of doing it:

list(
    id = "ID",
    B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
    B102 = "Estimates: Less than $10,000: 20.0 to 24.9 percent",
    B103 = "Estimates: $10,000 to $19,999",
    B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
) %>% 
    stringr::str_replace_all(
        c('Estimates: ' = 'e_', "Margins of error: " = 'm', "Less than \\$10,000: " = 'i0to9',
          "\\$10,000 to \\$19,999[:]?[ ]?" = 'i10to19', "Less than 20.0 percent" = 'p0to19', 
          "20.0 to 24.9 percent" = 'p20to24')
    ) %>% 
    setNames(names(df), .) %>% 
    rename(df, .)

Output:

# A tibble: 1 x 5
  ID    e_i0to9p0to19 e_i0to9p20to24 e_i10to19 mi0to9p0to19
  <chr>         <dbl>          <dbl>     <dbl>        <dbl>
1 a                 1              2         3          0.1

CodePudding user response:

df %>%
  set_names(var_label(.) %>%
  unlist() %>%
  str_replace_all(c("Estimates: " = 'e',
      "Margins of error:" = "m",
      "Less than \\$10,000:?" = "i0to9",
      "\\$10,000 to \\$19,999" ="i10to19",
      "Less than 20.0 percent" = "p0to19",
      "20.0 to 24.9 percent" = "p20to24",
      ' ' = '_')))
# A tibble: 1 x 5
  ID    ei0to9_p0to19 ei0to9_p20to24 ei10to19 m_i0to9_p0to19
  <chr>         <dbl>          <dbl>    <dbl>          <dbl>
1 a                 1              2        3            0.1
  •  Tags:  
  • Related