Home > Enterprise >  Using dplyr:lag() with patchy time series data
Using dplyr:lag() with patchy time series data

Time:01-08

I have some time-series data resembling the following:

df <- data.frame(id = c("001","001","001","002", "003","003","003","003", 
                        "004","004","004","005"), 
                 year_mon = c(2021-01, 2021-02, 2021-03, 2021-01, 2021-01, 2021-05, 2021-06, 
                              2021-08, 2021-03, 2021-04, 2021-07, 2021-08),
                 value = c(15, 20, 25, 30, 19, 17, 18, 21, 39, 45, 21, 10))
id   year_mon  value  
001  2021-01      15  
001  2021-02      20    
001  2021-03      25  
002  2021-01      30  
003  2021-01      19  
003  2021-05      17  
003  2021-06      18  
003  2021-08      21  
004  2021-03      39  
004  2021-04      45  
004  2021-07      21  
005  2021-08      10  

I want to lag 'value' by 1 month for each id. However, in the year_mon column there are some missing dates.

I tried doing:

new_df <- df %>%
  arrange(id, year_mon) %>% 
  group_by(id) %>% 
  mutate(lag_value = lag(value, 1L))

But this is returning values from the previous row, rather than the previous month. How do I do it so that the data output is:

id   year_mon  value  lag_value  
001  2021-01      15  NA  
001  2021-02      20  15  
001  2021-03      25  20  
002  2021-01      30  NA  
003  2021-01      19  NA  
003  2021-05      17  NA  
003  2021-06      18  17  
003  2021-08      21  NA  
004  2021-03      39  NA  
004  2021-04      45  39  
004  2021-07      21  NA  
005  2021-08      10  NA  

Whereby an NA is returned if the month in year_mon is not -1 month.

CodePudding user response:

Temporarily fill in all the missing months by converting them to dates and using complete

df %>%
  mutate(
    Date1 = as.Date(paste(year_mon,"-01",sep=""))
  ) %>%
  arrange(id, year_mon) %>% 
  group_by(id) %>%
  complete(Date1 = seq.Date(min(Date1), max(Date1), by='month')) %>%
  mutate(lag_value = lag(value, 1L)) %>%
  filter(!is.na(year_mon)) %>%
  select(-Date1)

   id    year_mon value lag_value
   <chr> <chr>    <dbl>     <dbl>
 1 001   2021-01     15        NA
 2 001   2021-02     20        15
 3 001   2021-03     25        20
 4 002   2021-01     30        NA
 5 003   2021-01     19        NA
 6 003   2021-05     17        NA
 7 003   2021-06     18        17
 8 003   2021-08     21        NA
 9 004   2021-03     39        NA
10 004   2021-04     45        39
11 004   2021-07     21        NA
12 005   2021-08     10        NA

CodePudding user response:

Define a function diff_mon which returns the difference in months between the current year_mon and prior one. It converts the argument to yearmon class and then uses diff to get the difference in years between the current and prior entry. Multiplying that by 12 and rounding gives the number of months between the current and prior item. If that is 1 then use the lag and otherwise use NA.

library(dplyr)
library(zoo)

# difference in months.
# x is character vec with format yyyy-mm (or has class with as.yearmon method)
diff_mon <- function(x) round(12 * c(NA, diff(as.yearmon(x))))

df %>% 
  arrange(id, year_mon) %>%
  group_by(id) %>%
  mutate(lag = ifelse(diff_mon(year_mon) == 1, lag(value), NA)) %>%
  ungroup

giving:

# A tibble: 12 x 4
   id    year_mon value   lag
   <chr> <chr>    <dbl> <dbl>
 1 001   2021-01     15    NA
 2 001   2021-02     20    15
 3 001   2021-03     25    20
 4 002   2021-01     30    NA
 5 003   2021-01     19    NA
 6 003   2021-05     17    NA
 7 003   2021-06     18    17
 8 003   2021-08     21    NA
 9 004   2021-03     39    NA
10 004   2021-04     45    39
11 004   2021-07     21    NA
12 005   2021-08     10    NA

Note

The quotes were missing around the year_mon values in the question so we used this.

df <- structure(list(id = c("001", "001", "001", "002", "003", "003", 
"003", "003", "004", "004", "004", "005"), year_mon = c("2021-01", 
"2021-02", "2021-03", "2021-01", "2021-01", "2021-05", "2021-06", 
"2021-08", "2021-03", "2021-04", "2021-07", "2021-08"), value = c(15, 
20, 25, 30, 19, 17, 18, 21, 39, 45, 21, 10)), class = "data.frame", row.names = c(NA, 
-12L))
  •  Tags:  
  • Related