Home > Software design >  Assigning multiple text flags to an observation
Assigning multiple text flags to an observation

Time:01-18

I have some temperature data. I want to write a simple QA/QC script that will look through it and flag (in the QA/QC sense) data requiring verification/manual checking. I want it to essentially append flags to the existing column without creating a whole new column for each individual flag. I have a way to do it but it is inelegant. Is there a cleaner way to be doing this?

d<-data.frame(time=1:20, temp=c(1:5,-60,7:10,NA,12:15,160,17:20)) 

time is merely sequential observations and temp is some fictitious temperature data.

d$Flag[is.na(d$temp)]<-"MISSING" #flag the missing data
d$Flag[d$temp>120&!is.na(d$temp)]<-paste(d$Flag[d$temp>120&!is.na(d$temp)],"High",sep="_") #flag data beyond a threshold
d$Flag[d$temp<(-40)&!is.na(d$temp)]<-paste(d$Flag[d$temp<(-40)&!is.na(d$temp)],"Low",sep="_") #flag data below a threshold
dtIdx<-which(abs(diff(d$temp,lag=1))>10) #set an index vector of changes >10 based on first derivative
d$Flag[dtIdx]<-paste(d$Flag[dtIdx],"D10",sep="_") #select data and paste in new codes 
d$Flag<-gsub("NA_","",d$Flag) #strip NA that is introduced to flags

This creates the variable Flag and then sequentially overwrites it with itself new information from each new condition. It works, but it feels messy. I also don't like having to clean the NAs that are introduced - can I ignore them from the outset somehow?

CodePudding user response:

Here is one option using tidyverse. For dtIdx, I temporarily create a new column with that information, then I create the Flag column with the other designations (i.e., MISSING, High, and Low) using case_when. Then, I unite the two columns ignoring NA and also drop dtIdx.

library(tidyverse)

df %>%
  mutate(
    dtIdx = ifelse(c(abs(diff(temp, lag = 1)) > 10, FALSE), "D10", NA),
    Flag = case_when(is.na(temp) ~ "MISSING",
                     temp > 120 ~ "High",
                     temp < -40 ~ "Low")) %>%
  unite(
    "Flag",
    c(dtIdx, Flag),
    sep = "_",
    remove = TRUE,
    na.rm = TRUE
  )

Output

   time temp     Flag
1     1    1         
2     2    2         
3     3    3         
4     4    4         
5     5    5      D10
6     6  -60  D10_Low
7     7    7         
8     8    8         
9     9    9         
10   10   10         
11   11   NA  MISSING
12   12   12         
13   13   13         
14   14   14         
15   15   15      D10
16   16  160 D10_High
17   17   17         
18   18   18         
19   19   19         
20   20   20    

Data

df <- structure(list(
  time = 1:20,
  temp = c(1, 2, 3, 4, 5,-60, 7, 8,
           9, 10, NA, 12, 13, 14, 15, 160, 17, 18, 19, 20)
),
class = "data.frame",
row.names = c(NA,-20L))

CodePudding user response:

You can abstract out a function from the procedure you use. Something like this

flag <- function(..., init, sep = "_") {
  trimws(Reduce(
    \(x, y) replace(x, y[[1L]], paste(x[y[[1L]]], y[[2L]], sep = sep)), 
    list(...), init = init
  ), "left", sep)
}

Then apply it like this

d$Flag <- flag(
  list(is.na(d$temp), "MISSING"), 
  list(which(d$temp > 120), "High"), 
  list(which(d$temp < -40), "Low"), 
  list(which(abs(diff(d$temp, lag = 1)) > 10), "D10"), 
  init = character(nrow(d))
)

Output

   time temp     Flag
1     1    1         
2     2    2         
3     3    3         
4     4    4         
5     5    5      D10
6     6  -60  Low_D10
7     7    7         
8     8    8         
9     9    9         
10   10   10         
11   11   NA  MISSING
12   12   12         
13   13   13         
14   14   14         
15   15   15      D10
16   16  160 High_D10
17   17   17         
18   18   18         
19   19   19         
20   20   20         

Or use factor with interaction.

na_as <- forcats::fct_explicit_na
DEFAULT <- ""
d$Flag <- trimws(whitespace = "_", interaction(sep = "_", 
  factor(is.na(d$temp), labels = c(DEFAULT, "MISSING")), 
  na_as(factor(findInterval(d$temp, c(-40, 120)), labels = c("Low", DEFAULT, "High")), DEFAULT), 
  na_as(factor(abs(c(diff(d$temp, lag = 1), NA)) > 10, labels = c(DEFAULT, "D10")), DEFAULT)
))

You get the same output as above.

  •  Tags:  
  • Related