I have some time-series data resembling the following:
df <- data.frame(id = c("001","001","001","002", "003","003","003","003",
"004","004","004","005"),
year_mon = c(2021-01, 2021-02, 2021-03, 2021-01, 2021-01, 2021-05, 2021-06,
2021-08, 2021-03, 2021-04, 2021-07, 2021-08),
value = c(15, 20, 25, 30, 19, 17, 18, 21, 39, 45, 21, 10))
id year_mon value
001 2021-01 15
001 2021-02 20
001 2021-03 25
002 2021-01 30
003 2021-01 19
003 2021-05 17
003 2021-06 18
003 2021-08 21
004 2021-03 39
004 2021-04 45
004 2021-07 21
005 2021-08 10
I want to lag 'value' by 1 month for each id. However, in the year_mon column there are some missing dates.
I tried doing:
new_df <- df %>%
arrange(id, year_mon) %>%
group_by(id) %>%
mutate(lag_value = lag(value, 1L))
But this is returning values from the previous row, rather than the previous month. How do I do it so that the data output is:
id year_mon value lag_value
001 2021-01 15 NA
001 2021-02 20 15
001 2021-03 25 20
002 2021-01 30 NA
003 2021-01 19 NA
003 2021-05 17 NA
003 2021-06 18 17
003 2021-08 21 NA
004 2021-03 39 NA
004 2021-04 45 39
004 2021-07 21 NA
005 2021-08 10 NA
Whereby an NA is returned if the month in year_mon is not -1 month.
CodePudding user response:
Temporarily fill in all the missing months by converting them to dates and using complete
df %>%
mutate(
Date1 = as.Date(paste(year_mon,"-01",sep=""))
) %>%
arrange(id, year_mon) %>%
group_by(id) %>%
complete(Date1 = seq.Date(min(Date1), max(Date1), by='month')) %>%
mutate(lag_value = lag(value, 1L)) %>%
filter(!is.na(year_mon)) %>%
select(-Date1)
id year_mon value lag_value
<chr> <chr> <dbl> <dbl>
1 001 2021-01 15 NA
2 001 2021-02 20 15
3 001 2021-03 25 20
4 002 2021-01 30 NA
5 003 2021-01 19 NA
6 003 2021-05 17 NA
7 003 2021-06 18 17
8 003 2021-08 21 NA
9 004 2021-03 39 NA
10 004 2021-04 45 39
11 004 2021-07 21 NA
12 005 2021-08 10 NA
CodePudding user response:
Define a function diff_mon which returns the difference in months between the current year_mon and prior one. It converts the argument to yearmon class and then uses diff to get the difference in years between the current and prior entry. Multiplying that by 12 and rounding gives the number of months between the current and prior item. If that is 1 then use the lag and otherwise use NA.
library(dplyr)
library(zoo)
# difference in months.
# x is character vec with format yyyy-mm (or has class with as.yearmon method)
diff_mon <- function(x) round(12 * c(NA, diff(as.yearmon(x))))
df %>%
arrange(id, year_mon) %>%
group_by(id) %>%
mutate(lag = ifelse(diff_mon(year_mon) == 1, lag(value), NA)) %>%
ungroup
giving:
# A tibble: 12 x 4
id year_mon value lag
<chr> <chr> <dbl> <dbl>
1 001 2021-01 15 NA
2 001 2021-02 20 15
3 001 2021-03 25 20
4 002 2021-01 30 NA
5 003 2021-01 19 NA
6 003 2021-05 17 NA
7 003 2021-06 18 17
8 003 2021-08 21 NA
9 004 2021-03 39 NA
10 004 2021-04 45 39
11 004 2021-07 21 NA
12 005 2021-08 10 NA
Note
The quotes were missing around the year_mon values in the question so we used this.
df <- structure(list(id = c("001", "001", "001", "002", "003", "003",
"003", "003", "004", "004", "004", "005"), year_mon = c("2021-01",
"2021-02", "2021-03", "2021-01", "2021-01", "2021-05", "2021-06",
"2021-08", "2021-03", "2021-04", "2021-07", "2021-08"), value = c(15,
20, 25, 30, 19, 17, 18, 21, 39, 45, 21, 10)), class = "data.frame", row.names = c(NA,
-12L))
