I'd like some help to compute lag mean of a column just if a value is NA
Follow the example
library(tidyverse)
iname <- rep('a', 10)
itime <- 1:10
ival <- c(1, 2, 3, NA, 5, 6, NA, 8, 9, 10)
ds <- tibble(iname, itime, ival)
ds
# A tibble: 10 × 3
iname itime ival
<chr> <int> <dbl>
1 a 1 1
2 a 2 2
3 a 3 3
4 a 4 NA
5 a 5 5
6 a 6 6
7 a 7 NA
8 a 8 8
9 a 9 9
10 a 10 10
ds %>%
group_by(iname) %>%
mutate(...) -> result
result
# A tibble: 10 × 3
iname itime ival
<chr> <int> <dbl>
1 a 1 1
2 a 2 2
3 a 3 3
4 a 4 2
5 a 5 5
6 a 6 6
7 a 7 3.16
8 a 8 8
9 a 9 9
10 a 10 10
The itime==4 has an empty value in ival, so in the result, this value were calculated using the previous values mean(1,2,3).
In the itime==7 I don't really care if the computation includes or not the values imputed by the mean. But both solutions are welcome, cos it has some difference.
> mean(c(1, 2, 3, 2, 5, 6))
[1] 3.166667
> mean(c(1, 2, 3, 5, 6))
[1] 3.4
I found an equivalent question here Use tidyverse to replace NA with mean of data, by group
Thanks in advance
CodePudding user response:
A solution for the result where each subsequent mean doesn't rely on the previously updated one:
nacummean <- function(x) {
nas <- is.na(x)
cs <- cumsum(replace(x, nas, 0))
x[nas] <- cs[nas] / (which(nas) - cumsum(nas)[nas])
x
}
library(dplyr)
ds %>%
group_by(iname) %>%
mutate(ival = nacummean(ival))
## A tibble: 10 x 3
## Groups: iname [1]
# iname itime ival
# <chr> <int> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a 4 2
# 5 a 5 5
# 6 a 6 6
# 7 a 7 3.4
# 8 a 8 8
# 9 a 9 9
#10 a 10 10
The logic is: take a cumulative sum along the variable, after replacing NA values with 0, then replace all the NAs with the cumsum value at that point, divided by its position in the vector minus the cumulative number of NA values.
The other calculation is a bit trickier as it will probably require multiple passes. Here's one attempt using a while loop:
nacummean_upd <- function(x) {
while(sum(is.na(x)) > 0) {
fna <- match(NA,x)
xs <- head(x, fna)
x[seq(1,fna)] <- nacummean(xs)
}
x
}
library(dplyr)
ds %>%
group_by(iname) %>%
mutate(ival = nacummean_upd(ival))
## A tibble: 10 x 3
## Groups: iname [1]
# iname itime ival
# <chr> <int> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a 4 2
# 5 a 5 5
# 6 a 6 6
# 7 a 7 3.17
# 8 a 8 8
# 9 a 9 9
#10 a 10 10
CodePudding user response:
scenario 1: doesn't depend on replaced NA values
not_depend <- function(x) {
na_inds <- which(is.na(x))
x[na_inds] <- sapply(na_inds, function(y) {
mean(x[seq_len(y)], na.rm = TRUE)
})
x
}
ds %>% mutate(ival = not_depend(ival))
#> # A tibble: 10 × 3
#> iname itime ival
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 a 2 2
#> 3 a 3 3
#> 4 a 4 2
#> 5 a 5 5
#> 6 a 6 6
#> 7 a 7 3.4
#> 8 a 8 8
#> 9 a 9 9
#> 10 a 10 10
scenario 2: Depends on replaced NA values:
depends <- function(x) {
na_inds <- which(is.na(x))
sapply(na_inds, function(y){
x[y] <<- mean(x[seq_len(y)], na.rm = TRUE)
})
x
}
ds %>% mutate(ival = depends(ival))
#> # A tibble: 10 × 3
#> iname itime ival
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 a 2 2
#> 3 a 3 3
#> 4 a 4 2
#> 5 a 5 5
#> 6 a 6 6
#> 7 a 7 3.17
#> 8 a 8 8
#> 9 a 9 9
#> 10 a 10 10
