How to take measures of 4 rows sequentially for multiples variables-CodePudding

I`m struggling on how can I take the measure of 4 rows sequentially for multiples variables in my dataframe sorting by the date. Here is a toy example

test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C",  "A", "A","A","A","A","A" , "C"),
                  measure1 = c(10:24),
                  measure2 = c(1:15),
                  time= c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021", 
                          "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021",
                          "20-03-2021", "20-10-2021", "29-06-2021", "20-07-2021", "13-06-2021"))
#    my_groups measure1 measure2       time
# 1          A       18        9 15-01-2021
# 2          A       19       10 15-01-2021
# 3          A       12        3 19-09-2020
# 4          A       20       11 20-03-2021
# 5          A       23       14 20-07-2021
# 6          A       10        1 20-09-2020
# 7          A       21       12 20-10-2021
# 8          A       11        2 25-09-2020
# 9          A       22       13 29-06-2021
# 10         B       13        4 20-05-2020
# 11         B       14        5 20-06-2021
# 12         C       15        6 11-01-2021
# 13         C       16        7 13-01-2021
# 14         C       17        8 13-01-2021
# 15         C       24       15 13-06-2021

As result I'd like something like:

#    my_groups measure1    measure2    time
# 1          A       17.25     8.25      20-03-2021  #mean for the first 4 elements of A and maximun date
# 2          A       16.25     7.25      25-09-2020  #mean for the others 4 elements of A and maximun date
# 3          C       18        9         13-06-2021  #mean for the first 4 elements of B and maximun date

To get this result I thought of using something like this:

test %>% 
  arrange( my_groups,time) %>% 
  group_by(my_groups) %>% 
  summarise(measure1 = mean(measure1),
            measure2 = mean(measure2), 
            time = max(time))

But I'm having problem to find a solution to take these measures for 4 values consecutively.

Any hint on how can I do that?

CodePudding user response：

We can add another grouping variable g4 using integer division to ensure that we get groups of 4, and if desired, drop groups with fewer members. Then just run your summarize.

library(dplyr)

test %>% 
  arrange(my_groups, time) %>% 
  group_by(my_groups) %>% 
  mutate(g4 = (row_number() - 1) %/% 4) %>%
  group_by(my_groups, g4) %>%
  filter(n() == 4) %>%
  summarise(measure1 = mean(measure1),
            measure2 = mean(measure2), 
            time = max(time),
            .groups = "drop")
#> # A tibble: 3 × 5
#>   my_groups    g4 measure1 measure2 time      
#>   <chr>     <dbl>    <dbl>    <dbl> <chr>     
#> 1 A             0     17.2     8.25 20-03-2021
#> 2 A             1     16.2     7.25 25-09-2020
#> 3 C             0     18       9    13-06-2021

CodePudding user response：

Function "f" creates a sliding window (if you do windowing function much, I really recommend the slider package). Function "g" simply applies the function to all numeric windows. The final group_modify step applies the functions to the data and filters the good data.

library(slider)
library(lubridate)
library(dplyr)
library(tidyr)
library(purrr)

test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C",  "A", "A","A","A","A","A" , "C"),
                  measure1 = c(10:24),
                  measure2 = c(1:15),
                  time= dmy(c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021", 
                          "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021",
                          "20-03-2021", "20-10-2021", "29-06-2021", "20-07-2021", "13-06-2021"))) %>% 
       arrange(my_groups, time)  %>%
       group_by(my_groups)

f <- function(x) {t <- slide_mean(x, before=3, complete=T, step = 4)}
g <- function(y) {s <- y %>% map_if(is.numeric,  ~ f(.x)) %>% bind_cols()}
x <- test %>% group_modify(~ g(.x)) %>% filter(!(is.na(measure1) | is.na(measure2)))

  my_groups measure1 measure2 time      
  <chr>        <dbl>    <dbl> <date>    
1 A             12.8     3.75 2021-01-15
2 A             21      12    2021-07-20
3 C             18       9    2021-06-13

The differences between my and the answers above are because I translated the dates from character, so they sorted differently than as text.