I have a dataset with 1M rows. It has a start_date and an end_date using the format "aaaa-mm-dd hh-mm-ss".
I want to add a new column to the dataset with the duration of the time between the end and start date for each row.
So far, I'm able to get the time difference using the difftime function:
difftime("2020-11-01 13:45:40", "2020-11-01 13:36:00", units = "mins")
This gets me the following output: Time difference of 9.666667 mins, which I would like to replicate for the entire 1M-rows dataset.
For my test I'm working with a small tibble. I tried using the mutate function with rowwise and list. My code goes as follows:
rowwise() %>%
mutate(trip_duration = list((difftime(as.Date(df$`end time`), as.Date(df$`start time`), units = "mins"))))
This provides the following output:
# A tibble: 3 × 3
# Rowwise:
`start time` `end time` trip_duration
<chr> <chr> <list>
1 2020-11-01 13:36:00 2020-11-01 13:45:40 <drtn [3]>
2 2020-11-01 13:36:00 2020-11-01 13:45:40 <drtn [3]>
3 2020-11-01 13:36:00 2020-11-01 13:45:40 <drtn [3]>
The new column doesn't show what I'm looking for, it just shows the number 3 for each row no matter if I ask for the result in mins, secs, or even hours, and now I'm stuck trying to figure out the way to do the calculation.
Thanks in advance to anyone able to help, cheers!
CodePudding user response:
You need to make some changes in your code.
First and foremost, don't use $ in dplyr pipes. Pipes (%>%) were created to avoid using df$column_name everytime you want to use variable from the dataframe. Using $ can have unintended consequences when grouping the data or using rowwise as you can see in your case.
Secondly, difftime is vectorised so no need of rowwise here.
Finally, if you want time difference in minutes you should change the values to POSIXct type and not dates. Try the following -
library(dplyr)
df <- df %>%
mutate(trip_duration = difftime(as.POSIXct(`end time`),
as.POSIXct(`start time`), units = "mins"))
CodePudding user response:
You are wrapping it in a list.
This code:
rowwise() %>%
mutate(trip_duration = list((difftime(as.Date(df$`end time`), as.Date(df$`start time`), units = "mins"))))
Should be:
rowwise %>%
mutate(
trip_duration = as.Double( # in docs for ?difftime
difftime(
as.Date(df$`end time`),
as.Date(df$`start time`),
units = "mins")
)
)
)
If you are going to be working with dates/times I strongly recommend using lubridate: https://lubridate.tidyverse.org/index.html it simplifies a lot of common problems, this one included.
