Home > Software design >  Clarification on Reshaping DataFrame from LONG to WIDE in R with Gather/Spread
Clarification on Reshaping DataFrame from LONG to WIDE in R with Gather/Spread

Time:02-01

I am sorry to come back on a topic that has several threads on Stack, but I am trying to Reshape a dataset from LONG to WIDE, using Tidyverse, and either the Gather/Spread function as well as the pivot_wider function, and I am lost. Here it is a sample of the subset I am using for test

    structure(list(pid = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", 
"36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", 
"47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", 
"58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", 
"69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", 
"80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", 
"91", "92", "93", "94", "95", "96", "97", "98", "99", "100", 
"101", "102", "103", "104", "105", "106", "107", "108", "109", 
"110", "111", "112", "113", "114", "115", "116", "117", "118", 
"119", "120", "121", "122", "123", "124", "125", "126", "127", 
"128", "129", "130", "131", "132", "133", "134", "135", "136", 
"137", "138", "139", "140", "141", "142", "143", "144", "145", 
"146", "147", "148", "149", "150", "151", "152", "153", "154", 
"155", "156", "157", "158", "159", "160", "161", "162", "163", 
"164", "165", "166", "167", "168", "169", "170", "171", "172", 
"173", "174", "175", "176", "177", "178", "179", "180", "181", 
"182", "183", "184", "185", "186", "187", "188", "189", "190", 
"191", "192", "193", "194", "195", "196", "197", "198", "199", 
"200", "201", "202", "203", "204", "205", "206", "207", "208", 
"209", "210", "211", "212", "213", "214", "215", "216"), class = "factor"), 
    timewave = structure(c(1L, 2L, 3L, 4L, 5L, 1L), .Label = c("1", 
    "2", "3", "4", "5", "6", "7", "8"), class = "factor"), dev_icd = structure(c(1L, 
    1L, 1L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"), 
    lab_bnp = c(388, 199, 387.5, 318, 154, 949.4)), row.names = c(NA, 
6L), class = "data.frame")

And here there the 2 commands I came up to

test.wide2 <- test.long2 %>%
          pivot_wider(id_cols = pid, 
                      names_from = timewave, 
                      values_from = c(dev_icd, lab_bnp), 
                      names_sep = "")

Or also

test.wide <- test.long2 %>%
         group_by(pid) %>%
         gather("dev_icd", "lab_bnp", 
                key = variable, value = number ) %>%
         unite(combi, variable, timewave) %>%
         spread(combi, number)

Neither is working as I would expect, I get a lot of NA or NULL values and do not understand what is my mistake and the correct procedure. Any help not only in fixing the issue but mainly in understanding the Reshaping logic/philosophy would be greatly appreciated

CodePudding user response:

Thanks to help from Merjin van Tiborg, I have finally nailed down the issue. The correct command to have PID on rows and columns for dev-icd and lab_bnp repeated by timewave number is as follows:

    test.wide <- hf.longsmall %>%
  pivot_wider(id_cols = c("pid", "timewave"), 
              names_from = timewave, 
              values_from = c(dev_icd, lab_bnp), 
              names_sep = "_t")

That is equivalent to the following

    test.wide1  <- hf.longsmall %>% 
  group_by(pid, timewave) %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_wider(names_from = timewave, 
                     values_from = c(dev_icd, lab_bnp), 
                     names_sep = "_t") %>%
  select(-row)

I have got the following warning --> "Values in values_from are not uniquely identified; output will contain list-cols", and that was due to a real (and dangerous) mistake of duplicated PID during data entry. Anyhow, I could understand the issue only using the group_by option reported above.

Thank you all for the patience

CodePudding user response:

You need to be more explicit on what you want, we can only assume. You cannot expect any values that are not present in your widest format. I take a guess here in that you want something like this.

test.long2 %>%
    pivot_wider(id_cols = c("pid", "timewave"), 
    names_from = pid, 
    values_from = c(dev_icd, lab_bnp), 
    names_sep = "_pid")

# A tibble: 5 x 5
  timewave dev_icd_pid1 dev_icd_pid2 lab_bnp_pid1 lab_bnp_pid2
  <fct>    <fct>        <fct>               <dbl>        <dbl>
1 1        No           Yes                  388          949.
2 2        No           NA                   199           NA 
3 3        No           NA                   388.          NA 
4 4        No           NA                   318           NA 
5 5        No           NA                   154           NA 
  •  Tags:  
  • Related