I'm trying to run some statistics on air quality data but the sensors have a couple of days with errors: where the sensor was either offline, or readings were too high. I'd like to replace the data on those days with the average for the week for that location (sensor).
airDataWork <- data.frame(
Location= c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
Date = 2021-03-19, 2021-03-20, 2021-03-21, 2021-03-22, 2021-03-19, 2021-03-20, 2021-03-21),
maxParticulates=c(10,0,12,1900,2,2,1000)
)
#set bounds
lower_bound <- quantile(airDataWork$maxParticulates, 0.10)
upper_bound <- quantile(airDataWork$maxParticulates, 0.90)
#get the outliers
outlier_ind <- which(airDataWork$maxParticulates < lower_bound | airDataWork$maxParticulates > upper_bound)
#exclude the outliers and store the average for each location\week
airDataClean <- airDataWork[-outlier_ind, ] %>%
group_by(Location, year(Date), week(Date)) %>%
summarise(mean(maxParticulates))
#????
How can I now update the maxParticulates figure in my outlier data with average figure from airDataClean for the Location/year/week using Location/Date in the source?
Many thanks
CodePudding user response:
A dplyr solution could be to summarize the grouped values (as you were are doing) and then using a join-function and a simple ifelse condition to change the outliers:
airDataWork %>%
mutate(Date = as.Date(Date, format="%Y-%m-%d")) %>%
group_by(Location, lubridate::year(Date), lubridate::week(Date)) %>%
summarise(newVal=mean(maxParticulates)) %>%
right_join(airDataWork, .) %>%
mutate(maxParticulates = ifelse(((maxParticulates <= lower_bound) | (maxParticulates >= upper_bound)), newVal, maxParticulates)) %>%
select(., c(1,2,3)) %>%
as.data.frame()
Location Date maxParticulates
1 a 2021-03-19 10.0
2 a 2021-03-20 480.5
3 a 2021-03-21 12.0
4 a 2021-03-22 480.5
5 b 2021-03-19 2.0
6 b 2021-03-20 2.0
7 b 2021-03-21 1000.0
CodePudding user response:
Using base R
We may exploit split to split the airDataWork by Location. Then use lapply to get the mean maxParticulates per Location in case the value of maxParticulates is an classified as an outlier (according to your definition). If it is not an outlier, keep the original value of maxParticulates.
# your definition of the outliers
lower_bound <- quantile(airDataWork$maxParticulates, 0.10)
upper_bound <- quantile(airDataWork$maxParticulates, 0.90)
# create the desired output values
maxParticulates_update <-
do.call(c,lapply(split(airDataWork, airDataWork$Location), \(x) {
with(x, ifelse(maxParticulates < lower_bound | maxParticulates > upper_bound,
mean(maxParticulates), maxParticulates))
}))
# copy original data.frame
airDataClean <- airDataWork
# assign the desired output values to this new data.frame
airDataClean$maxParticulates <- unname(maxParticulates_update)
Output
> airDataClean
Location Date maxParticulates
1 a 2021-03-19 10.0
2 a 2021-03-20 480.5
3 a 2021-03-21 12.0
4 a 2021-03-22 480.5
5 b 2021-03-19 2.0
6 b 2021-03-20 2.0
7 b 2021-03-21 1000.0
Data
airDataWork <- structure(list(Location = c("a", "a", "a", "a", "b", "b", "b"
), Date = structure(c(18705, 18706, 18707, 18708, 18705, 18706,
18707), class = "Date"), maxParticulates = c(10, 0, 12, 1900,
2, 2, 1000)), class = "data.frame", row.names = c(NA, -7L))
