I would like to use some sort of regex function for the rest of the data (~2 million rows) to extract relevant date information (ideally have Day, Month, and time in date format since Year is only 2009 for ~2 million rows).
I have a column that looks like so:
ID | created_at
1 Mon Apr 06 22:19:45 PDT 2009
2 Mon Apr 06 22:19:49 PDT 2009
I applied these two functions to extract "day" and delete 'PDT 2009' from the end but now I would like to make the rest of the columns be date format for plotting purposes.
df$Day <- sub("([A-Za-z] ).*", "\\1", df$created_at) ## Extract first word
df$delete <- gsub("\\PDT.*","", df$created_at) ## Delete everything after PDT starts
Desired outcome:
ID | created_at | Month | Day | Time
1 Mon Apr 06 22:19:45 PDT 2009 Apr Mon 22:19:45
2 Mon Apr 06 22:19:49 PDT 2009 Apr Mon 22:19:49
CodePudding user response:
You do not need to use any regexes, just regular date formatting is sufficient. you can find a nice overview here or in ?strptime(). You just have to adjust for the separators. This should be easier and more efficient than using regexes, splitting, etc...
Once you have this in native R DateTime formats POSIXlt and POSIXt, you can easily extract all date-related information.
strptime(x = "Mon Apr 06 22:19:45 PDT 2009",
format = "%a %b %d %H:%M:%S PDT %Y")
#> [1] "2009-04-06 22:19:45 CEST"
CodePudding user response:
Here is an approach using str_split in companion with map_char:
library(tidyverse)
df %>%
mutate(elements = str_split(created_at, fixed(" "), n=6)) %>%
mutate(Month = map_chr(elements, 2),
Day = map_chr(elements, 1),
Time = map_chr(elements, 4), .keep="unused"
)
output:
ID created_at Month Day Time
1 1 Mon Apr 06 22:19:45 PDT 2009 Apr Mon 22:19:45
2 2 Mon Apr 06 22:19:49 PDT 2009 Apr Mon 22:19:49
CodePudding user response:
You can use the following solution too:
library(dplyr)
df %>%
mutate(ID = row_number(),
Month = gsub("(?:[A-Za-z] )\\s([A-Za-z] ).*", "\\1", created_at, perl = TRUE),
Day = gsub("([A-Za-z] ).*", "\\1", created_at, perl = TRUE),
Time = gsub(".*(\\d{2}:\\d{2}:\\d{2}).*", "\\1", created_at, perl = TRUE)) %>%
relocate(ID)
# A tibble: 2 x 5
ID created_at Month Day Time
<int> <chr> <chr> <chr> <chr>
1 1 Mon Apr 06 22:19:45 PDT 2009 Apr Mon 22:19:45
2 2 Mon Apr 06 22:19:49 PDT 2009 Apr Mon 22:19:49
CodePudding user response:
If you're just after Month, Day, and Time, why not use extract from the tidyverse:
library(tidyr)
df %>%
extract(col = created_at,
into = c('Month', 'Day', 'Time'),
regex = "([A-Za-z] )\\s([A-Za-z] )\\s\\d{2}\\s([\\d:] )")
Month Day Time
1 Mon Apr 22:19:45
2 Mon Apr 22:19:49
Here, we define three capturing groups using the round brackets syntax (...) to identify the substrings we want to extract into the three columns.
If you also need created_atin its original form, just store the results as, say, df1 and use cbind:
cbind(df, df1)
created_at Month Day Time
1 Mon Apr 06 22:19:45 PDT 2009 Mon Apr 22:19:45
2 Mon Apr 06 22:19:49 PDT 2009 Mon Apr 22:19:49
Data:
df <-
data.frame(
created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009")
)
CodePudding user response:
I think this might help you
Libraries
library(tidyverse)
library(lubridate)
Data
df <-
tibble(
created_at = c("Mon Apr 06 22:19:45 PDT 2009","Mon Apr 06 22:19:49 PDT 2009")
)
Code
df %>%
separate(
col = created_at,
into = c("wday","month","day","time","type","year"),
sep = " ",
remove = FALSE
) %>%
mutate(
day = as.numeric(day),
year = as.numeric(year),
month_num = which(month.abb == month),
time = hms(time),
date = lubridate::make_date(year = year,month = month_num,day = day)
)
Results
# A tibble: 2 x 9
created_at wday month day time type year month_num date
<chr> <chr> <chr> <dbl> <Period> <chr> <dbl> <int> <date>
1 Mon Apr 06 22:19:45 PDT 2009 Mon Apr 6 22H 19M 45S PDT 2009 4 2009-04-06
2 Mon Apr 06 22:19:49 PDT 2009 Mon Apr 6 22H 19M 49S PDT 2009 4 2009-04-06
