I need to reshape a dataframe csf_antibio (sample below) from long to wide format, separating all values per patient by row.
study_id,proanbx_dt,proanbx_tm,name,othername,route,dosage,units,doses,freq
CHLA_0001,2021-07-22,20:01:00,ceftriaxone,,IV,1250,mg,4,13
CHLA_0001,2021-07-22,20:19:00,metronidazole,,IV,250,mg,5,9
CHLA_0001,2021-07-22,23:17:00,vancomycin,,IV,350,mg,3,6
CHLA_0001,2021-08-09,19:34:00,cefazolin,,IV,738,mg,1,8
CHLA_0002,2020-12-18,0:30:00,cefepime,,IV,75,mg,5,8
CHLA_0002,2020-12-18,1:03:00,vancomycin,,IV,23,mg,4,13
CHLA_0002,2020-12-19,18:15:00,cefepime,,IV,60,mg,6,8
CHLA_0002,2020-12-20,4:18:00,vancomycin,,IV,24,mg,4,12
CHLA_0003,2021-04-20,15:17:00,meropenem,,IV,200,mg,2,1
CHLA_0003,2021-04-21,2:20:00,meropenem,,IV,400,mg,17,8
CHLA_0003,2021-04-22,14:16:00,Other,sulfamethoxazole-trimethoprim,IV,50,mg,9,12
I tried the following without success:
csfmelt <- melt(csf_antibio, id.vars=1:1)
csf <- dcast(csfmelt, study_id ~ variable, value.var = "value", fun.aggregate = sum)
I want the final dataframe to have each study id per row with variables
study_id,proanbx_dt1,proanbx_tm1,name1,othername1,route1,dosage1,units1,doses1,freq1,proanbx_dt2,proanbx_tm2,name2,othername2,route2,dosage2,units2,doses2,freq2,proanbx_dt3,proanbx_tm3,name3,othername3,route3,dosage3,units3,doses3,freq3,proanbx_dt4,proanbx_tm4,name4,othername4,route4,dosage4,units4,doses4,freq4
CHLA_0001,2021-07-22,20:01:00,ceftriaxone,,IV,1250,mg,4,13, 2021-07-22,20:19:00,metronidazole,,IV,250,mg,5,9, 2021-07-22,23:17:00,vancomycin,,IV,350,mg,3,6,2021-08-09,19:34:00,cefazolin,,IV,738,mg,1,8
CHLA_0002,2020-12-18,0:30:00,cefepime,,IV,75,mg,5,8,2020-12-18,1:03:00,vancomycin,,IV,23,mg,4,13,2020-12-19,18:15:00,cefepime,,IV,60,mg,6,8,2020-12-20,4:18:00,vancomycin,,IV,24,mg,4,12,2021-04-20,15:17:00,meropenem,,IV,200,mg,2,1,2021-04-21,2:20:00,meropenem,,IV,400,mg,17,8,2021-04-22,14:16:00,Other,sulfamethoxazole-trimethoprim,IV,50,mg,9,12
Thanks in advance!
CodePudding user response:
Your desired output has a "number" component that is not naturally inferred by dcast. We can add it relatively easily with ave (base R, certainly this can be done just as easily in data.table or dplyr groupings).
reshape2 and base R
csfmelt$num <- ave(seq(nrow(csfmelt)), csfmelt[c("study_id","variable")], FUN = seq_along)
head(csfmelt)
# study_id variable value num
# 1 CHLA_0001 proanbx_dt 2021-07-22 1
# 2 CHLA_0001 proanbx_dt 2021-07-22 2
# 3 CHLA_0001 proanbx_dt 2021-07-22 3
# 4 CHLA_0001 proanbx_dt 2021-08-09 4
# 5 CHLA_0002 proanbx_dt 2020-12-18 1
# 6 CHLA_0002 proanbx_dt 2020-12-18 2
csfwide <- reshape2::dcast(csfmelt, study_id ~ variable num, value.var = "value")
csfwide
# study_id proanbx_dt_1 proanbx_dt_2 proanbx_dt_3 proanbx_dt_4 proanbx_tm_1 proanbx_tm_2 proanbx_tm_3 proanbx_tm_4 name_1 name_2 name_3 name_4 othername_1 othername_2 othername_3 othername_4 route_1 route_2 route_3 route_4 dosage_1 dosage_2 dosage_3 dosage_4 units_1 units_2 units_3 units_4 doses_1 doses_2 doses_3 doses_4 freq_1 freq_2 freq_3 freq_4
# 1 CHLA_0001 2021-07-22 2021-07-22 2021-07-22 2021-08-09 20:01:00 20:19:00 23:17:00 19:34:00 ceftriaxone metronidazole vancomycin cefazolin IV IV IV IV 1250 250 350 738 mg mg mg mg 4 5 3 1 13 9 6 8
# 2 CHLA_0002 2020-12-18 2020-12-18 2020-12-19 2020-12-20 0:30:00 1:03:00 18:15:00 4:18:00 cefepime vancomycin cefepime vancomycin IV IV IV IV 75 23 60 24 mg mg mg mg 5 4 6 4 8 13 8 12
# 3 CHLA_0003 2021-04-20 2021-04-21 2021-04-22 <NA> 15:17:00 2:20:00 14:16:00 <NA> meropenem meropenem Other <NA> sulfamethoxazole-trimethoprim <NA> IV IV IV <NA> 200 400 50 <NA> mg mg mg <NA> 2 17 9 <NA> 1 8 12 <NA>
The column order is not what you requested, but it can be conformed a bit with this:
variables <- as.character(unique(csfmelt$variable))
sub(".*_", "", names(csfwide)[-(1:2)])
# [1] "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4"
sub("_[^_]*$", "", names(csfwide)[-(1:2)])
# [1] "proanbx_dt" "proanbx_dt" "proanbx_dt" "proanbx_tm" "proanbx_tm" "proanbx_tm" "proanbx_tm" "name" "name" "name" "name" "othername" "othername" "othername" "othername" "route" "route" "route" "route" "dosage"
# [21] "dosage" "dosage" "dosage" "units" "units" "units" "units" "doses" "doses" "doses" "doses" "freq" "freq" "freq" "freq"
nms <- names(csfwide)[-(1:2)]
newnms <- nms[order(sub(".*_", "", nms), match(nms, variables))]
csfwide2 <- subset(csfwide, select = c(names(csfwide)[1:2], newnms))
csfwide2
# study_id proanbx_dt_1 proanbx_tm_1 name_1 othername_1 route_1 dosage_1 units_1 doses_1 freq_1 proanbx_dt_2 proanbx_tm_2 name_2 othername_2 route_2 dosage_2 units_2 doses_2 freq_2 proanbx_dt_3 proanbx_tm_3 name_3 othername_3 route_3 dosage_3 units_3 doses_3 freq_3 proanbx_dt_4 proanbx_tm_4 name_4 othername_4 route_4 dosage_4 units_4 doses_4 freq_4
# 1 CHLA_0001 2021-07-22 20:01:00 ceftriaxone IV 1250 mg 4 13 2021-07-22 20:19:00 metronidazole IV 250 mg 5 9 2021-07-22 23:17:00 vancomycin IV 350 mg 3 6 2021-08-09 19:34:00 cefazolin IV 738 mg 1 8
# 2 CHLA_0002 2020-12-18 0:30:00 cefepime IV 75 mg 5 8 2020-12-18 1:03:00 vancomycin IV 23 mg 4 13 2020-12-19 18:15:00 cefepime IV 60 mg 6 8 2020-12-20 4:18:00 vancomycin IV 24 mg 4 12
# 3 CHLA_0003 2021-04-20 15:17:00 meropenem IV 200 mg 2 1 2021-04-21 2:20:00 meropenem IV 400 mg 17 8 2021-04-22 14:16:00 Other sulfamethoxazole-trimethoprim IV 50 mg 9 12 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
CodePudding user response:
@r2evans gave you a great answer, but I was thinking about your comments regarding dates and time. You didn't provide how you collected this data, so I can't tell you how to import it this way. However, I did convert these variables in the following code. That being said, adding dates isn't meaningful. I was thinking that the number of days and the amount of time passed might be more along the lines of what you were looking for for those particular variables. Unfortunately, I wasn't able to figure out how to make it work with reshape2. This uses dplyr, tidyselect and hms. Although, you would only have to call dplyr, because I've appended the packages for the applicable functions. (You need the packages installed, though.)
I didn't keep the name and othername because it's not multiple entries.
library(dplyr)
csf_antibio = read.table(header = T, sep = ",", text = "study_id,proanbx_dt,proanbx_tm,name,othername,route,dosage,units,doses,freq
CHLA_0001,2021-07-22,20:01:00,ceftriaxone,,IV,1250,mg,4,13
CHLA_0001,2021-07-22,20:19:00,metronidazole,,IV,250,mg,5,9
CHLA_0001,2021-07-22,23:17:00,vancomycin,,IV,350,mg,3,6
CHLA_0001,2021-08-09,19:34:00,cefazolin,,IV,738,mg,1,8
CHLA_0002,2020-12-18,0:30:00,cefepime,,IV,75,mg,5,8
CHLA_0002,2020-12-18,1:03:00,vancomycin,,IV,23,mg,4,13
CHLA_0002,2020-12-19,18:15:00,cefepime,,IV,60,mg,6,8
CHLA_0002,2020-12-20,4:18:00,vancomycin,,IV,24,mg,4,12
CHLA_0003,2021-04-20,15:17:00,meropenem,,IV,200,mg,2,1
CHLA_0003,2021-04-21,2:20:00,meropenem,,IV,400,mg,17,8
CHLA_0003,2021-04-22,14:16:00,Other,sulfamethoxazole-trimethoprim,IV,50,mg,9,12")
Because the time is truly linked to the date, I wrote a function to process the time difference.
timer <- function(df1){
maxtm = max(df1$proanbx_tm[df1$proanbx_dt == max(df1$proanbx_dt)]) %>% hms::as_hms()
mintm = min(df1$proanbx_tm[df1$proanbx_dt == min(df1$proanbx_dt)]) %>% hms::as_hms()
if(maxtm > mintm){
tmr = (maxtm - mintm) %>% hms::as_hms() # captures mult entries in the same day
} else if(mintm > maxtm) {
tmr = (maxtm - mintm) hms::as_hms('24:00:00') # add a full day
} else { # only one entry or the time is identical in max/min
tmr = hms::as_hms('0')
}
return(tmr)
}
I collected the column names to return the columns to the original order.
ordNames = names(csf_antibio) # collect names to return order to columns
# [1] "study_id" "proanbx_dt" "proanbx_tm" "name" "othername" "route"
# [7] "dosage" "units" "doses" "freq"
# names kept = ordNames[,c(1:3,6:10)]
Find the sums and differences in time
csf2 <- csf_antibio %>%
mutate(proanbx_dt = as.Date(proanbx_dt), # convert to date
proanbx_tm = hms::as_hms(proanbx_tm)) %>% # convert to time
group_by(study_id) %>% # group by study
summarise(proanbx_tm = timer(.data), # difference in time
proanbx_dt = max(proanbx_dt) - min(proanbx_dt), # difference in days
across(tidyselect:::where(is.integer), sum),
units = "mg",
route = "IV") %>%
select(ordNames[c(1:3,6:10)])
head(csf2)
# # A tibble: 3 × 8
# study_id proanbx_dt proanbx_tm route dosage units doses freq
# <chr> <drtn> <time> <chr> <int> <chr> <int> <int>
# 1 CHLA_0001 18 days 23:33 IV 2588 mg 13 36
# 2 CHLA_0002 2 days 03:48 IV 182 mg 19 41
# 3 CHLA_0003 2 days 22:59 IV 650 mg 28 21
