I have some temperature data. I want to write a simple QA/QC script that will look through it and flag (in the QA/QC sense) data requiring verification/manual checking. I want it to essentially append flags to the existing column without creating a whole new column for each individual flag. I have a way to do it but it is inelegant. Is there a cleaner way to be doing this?
d<-data.frame(time=1:20, temp=c(1:5,-60,7:10,NA,12:15,160,17:20))
time is merely sequential observations and temp is some fictitious temperature data.
d$Flag[is.na(d$temp)]<-"MISSING" #flag the missing data
d$Flag[d$temp>120&!is.na(d$temp)]<-paste(d$Flag[d$temp>120&!is.na(d$temp)],"High",sep="_") #flag data beyond a threshold
d$Flag[d$temp<(-40)&!is.na(d$temp)]<-paste(d$Flag[d$temp<(-40)&!is.na(d$temp)],"Low",sep="_") #flag data below a threshold
dtIdx<-which(abs(diff(d$temp,lag=1))>10) #set an index vector of changes >10 based on first derivative
d$Flag[dtIdx]<-paste(d$Flag[dtIdx],"D10",sep="_") #select data and paste in new codes
d$Flag<-gsub("NA_","",d$Flag) #strip NA that is introduced to flags
This creates the variable Flag and then sequentially overwrites it with itself new information from each new condition. It works, but it feels messy. I also don't like having to clean the NAs that are introduced - can I ignore them from the outset somehow?
CodePudding user response:
Here is one option using tidyverse. For dtIdx, I temporarily create a new column with that information, then I create the Flag column with the other designations (i.e., MISSING, High, and Low) using case_when. Then, I unite the two columns ignoring NA and also drop dtIdx.
library(tidyverse)
df %>%
mutate(
dtIdx = ifelse(c(abs(diff(temp, lag = 1)) > 10, FALSE), "D10", NA),
Flag = case_when(is.na(temp) ~ "MISSING",
temp > 120 ~ "High",
temp < -40 ~ "Low")) %>%
unite(
"Flag",
c(dtIdx, Flag),
sep = "_",
remove = TRUE,
na.rm = TRUE
)
Output
time temp Flag
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5 D10
6 6 -60 D10_Low
7 7 7
8 8 8
9 9 9
10 10 10
11 11 NA MISSING
12 12 12
13 13 13
14 14 14
15 15 15 D10
16 16 160 D10_High
17 17 17
18 18 18
19 19 19
20 20 20
Data
df <- structure(list(
time = 1:20,
temp = c(1, 2, 3, 4, 5,-60, 7, 8,
9, 10, NA, 12, 13, 14, 15, 160, 17, 18, 19, 20)
),
class = "data.frame",
row.names = c(NA,-20L))
CodePudding user response:
You can abstract out a function from the procedure you use. Something like this
flag <- function(..., init, sep = "_") {
trimws(Reduce(
\(x, y) replace(x, y[[1L]], paste(x[y[[1L]]], y[[2L]], sep = sep)),
list(...), init = init
), "left", sep)
}
Then apply it like this
d$Flag <- flag(
list(is.na(d$temp), "MISSING"),
list(which(d$temp > 120), "High"),
list(which(d$temp < -40), "Low"),
list(which(abs(diff(d$temp, lag = 1)) > 10), "D10"),
init = character(nrow(d))
)
Output
time temp Flag
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5 D10
6 6 -60 Low_D10
7 7 7
8 8 8
9 9 9
10 10 10
11 11 NA MISSING
12 12 12
13 13 13
14 14 14
15 15 15 D10
16 16 160 High_D10
17 17 17
18 18 18
19 19 19
20 20 20
Or use factor with interaction.
na_as <- forcats::fct_explicit_na
DEFAULT <- ""
d$Flag <- trimws(whitespace = "_", interaction(sep = "_",
factor(is.na(d$temp), labels = c(DEFAULT, "MISSING")),
na_as(factor(findInterval(d$temp, c(-40, 120)), labels = c("Low", DEFAULT, "High")), DEFAULT),
na_as(factor(abs(c(diff(d$temp, lag = 1), NA)) > 10, labels = c(DEFAULT, "D10")), DEFAULT)
))
You get the same output as above.
