How to convert column into date format in R that has extra information-CodePudding

I would like to use some sort of regex function for the rest of the data (~2 million rows) to extract relevant date information (ideally have Day, Month, and time in date format since Year is only 2009 for ~2 million rows).

I have a column that looks like so:

ID | created_at
1    Mon Apr 06 22:19:45 PDT 2009
2    Mon Apr 06 22:19:49 PDT 2009

I applied these two functions to extract "day" and delete 'PDT 2009' from the end but now I would like to make the rest of the columns be date format for plotting purposes.

df$Day <- sub("([A-Za-z] ).*", "\\1", df$created_at) ## Extract first word

df$delete <- gsub("\\PDT.*","", df$created_at) ## Delete everything after PDT starts

Desired outcome:

ID | created_at                    | Month  | Day  | Time
1    Mon Apr 06 22:19:45 PDT 2009    Apr      Mon    22:19:45
2    Mon Apr 06 22:19:49 PDT 2009    Apr      Mon    22:19:49

CodePudding user response：

You do not need to use any regexes, just regular date formatting is sufficient. you can find a nice overview here or in ?strptime(). You just have to adjust for the separators. This should be easier and more efficient than using regexes, splitting, etc...

Once you have this in native R DateTime formats POSIXlt and POSIXt, you can easily extract all date-related information.

strptime(x      = "Mon Apr 06 22:19:45 PDT 2009",
         format = "%a %b %d %H:%M:%S PDT %Y")
#> [1] "2009-04-06 22:19:45 CEST"

CodePudding user response：

Here is an approach using str_split in companion with map_char:

library(tidyverse)
df %>% 
    mutate(elements = str_split(created_at, fixed(" "), n=6)) %>% 
    mutate(Month = map_chr(elements, 2),
           Day = map_chr(elements, 1),
           Time = map_chr(elements, 4), .keep="unused"
           )

output:

  ID                   created_at Month Day     Time
1  1 Mon Apr 06 22:19:45 PDT 2009   Apr Mon 22:19:45
2  2 Mon Apr 06 22:19:49 PDT 2009   Apr Mon 22:19:49

CodePudding user response：

You can use the following solution too:

library(dplyr)

df %>%
  mutate(ID = row_number(), 
         Month = gsub("(?:[A-Za-z] )\\s([A-Za-z] ).*", "\\1", created_at, perl = TRUE),
         Day = gsub("([A-Za-z] ).*", "\\1", created_at, perl = TRUE), 
         Time = gsub(".*(\\d{2}:\\d{2}:\\d{2}).*", "\\1", created_at, perl = TRUE)) %>%
  relocate(ID)

# A tibble: 2 x 5
     ID created_at                   Month Day   Time    
  <int> <chr>                        <chr> <chr> <chr>   
1     1 Mon Apr 06 22:19:45 PDT 2009 Apr   Mon   22:19:45
2     2 Mon Apr 06 22:19:49 PDT 2009 Apr   Mon   22:19:49

CodePudding user response：

If you're just after Month, Day, and Time, why not use extract from the tidyverse:

library(tidyr)
df %>%
  extract(col = created_at, 
          into = c('Month', 'Day', 'Time'), 
          regex = "([A-Za-z] )\\s([A-Za-z] )\\s\\d{2}\\s([\\d:] )")
  Month Day     Time
1   Mon Apr 22:19:45
2   Mon Apr 22:19:49

Here, we define three capturing groups using the round brackets syntax (...) to identify the substrings we want to extract into the three columns.

If you also need created_atin its original form, just store the results as, say, df1 and use cbind:

cbind(df, df1)
                    created_at Month Day     Time
1 Mon Apr 06 22:19:45 PDT 2009   Mon Apr 22:19:45
2 Mon Apr 06 22:19:49 PDT 2009   Mon Apr 22:19:49

Data:

df <- 
  data.frame(
    created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009") 
  )

CodePudding user response：

I think this might help you

Libraries

library(tidyverse)
library(lubridate)

Data

df <- 
  tibble(
    created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009") 
  )

Code

df %>% 
  separate(
    col = created_at,
    into = c("wday","month","day","time","type","year"),
    sep = " ",
    remove = FALSE
    ) %>% 
  mutate(
    day = as.numeric(day),
    year = as.numeric(year),
    month_num = which(month.abb == month),
    time = hms(time),
    date = lubridate::make_date(year = year,month = month_num,day = day)
  )

Results

# A tibble: 2 x 9
  created_at                   wday  month   day time        type   year month_num date      
  <chr>                        <chr> <chr> <dbl> <Period>    <chr> <dbl>     <int> <date>    
1 Mon Apr 06 22:19:45 PDT 2009 Mon   Apr       6 22H 19M 45S PDT    2009         4 2009-04-06
2 Mon Apr 06 22:19:49 PDT 2009 Mon   Apr       6 22H 19M 49S PDT    2009         4 2009-04-06