I`m struggling on how can I take the measure of 4 rows sequentially for multiples variables in my dataframe sorting by the date. Here is a toy example
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A","A","A","A","A" , "C"),
measure1 = c(10:24),
measure2 = c(1:15),
time= c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021",
"20-03-2021", "20-10-2021", "29-06-2021", "20-07-2021", "13-06-2021"))
# my_groups measure1 measure2 time
# 1 A 18 9 15-01-2021
# 2 A 19 10 15-01-2021
# 3 A 12 3 19-09-2020
# 4 A 20 11 20-03-2021
# 5 A 23 14 20-07-2021
# 6 A 10 1 20-09-2020
# 7 A 21 12 20-10-2021
# 8 A 11 2 25-09-2020
# 9 A 22 13 29-06-2021
# 10 B 13 4 20-05-2020
# 11 B 14 5 20-06-2021
# 12 C 15 6 11-01-2021
# 13 C 16 7 13-01-2021
# 14 C 17 8 13-01-2021
# 15 C 24 15 13-06-2021
As result I'd like something like:
# my_groups measure1 measure2 time
# 1 A 17.25 8.25 20-03-2021 #mean for the first 4 elements of A and maximun date
# 2 A 16.25 7.25 25-09-2020 #mean for the others 4 elements of A and maximun date
# 3 C 18 9 13-06-2021 #mean for the first 4 elements of B and maximun date
To get this result I thought of using something like this:
test %>%
arrange( my_groups,time) %>%
group_by(my_groups) %>%
summarise(measure1 = mean(measure1),
measure2 = mean(measure2),
time = max(time))
But I'm having problem to find a solution to take these measures for 4 values consecutively.
Any hint on how can I do that?
CodePudding user response:
We can add another grouping variable g4 using integer division to ensure that we get groups of 4, and if desired, drop groups with fewer members. Then just run your summarize.
library(dplyr)
test %>%
arrange(my_groups, time) %>%
group_by(my_groups) %>%
mutate(g4 = (row_number() - 1) %/% 4) %>%
group_by(my_groups, g4) %>%
filter(n() == 4) %>%
summarise(measure1 = mean(measure1),
measure2 = mean(measure2),
time = max(time),
.groups = "drop")
#> # A tibble: 3 × 5
#> my_groups g4 measure1 measure2 time
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 0 17.2 8.25 20-03-2021
#> 2 A 1 16.2 7.25 25-09-2020
#> 3 C 0 18 9 13-06-2021
CodePudding user response:
Function "f" creates a sliding window (if you do windowing function much, I really recommend the slider package). Function "g" simply applies the function to all numeric windows. The final group_modify step applies the functions to the data and filters the good data.
library(slider)
library(lubridate)
library(dplyr)
library(tidyr)
library(purrr)
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A","A","A","A","A" , "C"),
measure1 = c(10:24),
measure2 = c(1:15),
time= dmy(c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021",
"20-03-2021", "20-10-2021", "29-06-2021", "20-07-2021", "13-06-2021"))) %>%
arrange(my_groups, time) %>%
group_by(my_groups)
f <- function(x) {t <- slide_mean(x, before=3, complete=T, step = 4)}
g <- function(y) {s <- y %>% map_if(is.numeric, ~ f(.x)) %>% bind_cols()}
x <- test %>% group_modify(~ g(.x)) %>% filter(!(is.na(measure1) | is.na(measure2)))
my_groups measure1 measure2 time
<chr> <dbl> <dbl> <date>
1 A 12.8 3.75 2021-01-15
2 A 21 12 2021-07-20
3 C 18 9 2021-06-13
The differences between my and the answers above are because I translated the dates from character, so they sorted differently than as text.
