Okay, this question might be a little complicated, but here we go.
I recently learned how to use rowSums to count the number of times a certain number appears across a row. Here is my dataframe I'm working with:
Subjects LayCCD Day.1 Day.2 Day.3 Day.4 Day.5 Day.6 Day.7 Day.8
Subject 1 Day 2 1.0 0.0 0 1 1.0 1.5 200.0 1
Subject 2 Day 5 160.0 1.5 0 0 1.0 1.0 160.0 1
Subject 3 Day 4 0.0 0.0 0 1 1.0 1.0 1.0 0
Subject 4 Day 7 1.5 1.0 1 1 1.5 1.5 1.5 1
The first columnn has the Subject name/ID, the second has the LayCCD (the variable I want to reference), and the rest are Days. Here is the code I used so far to count the number of times a 1 appears in each row, a 0 appears in each row, and any number other than a 0 appears in each row:
df$PresentAlone <- rowSums(df[,3:10] == "1")
df$Absent <- rowSums(df[,3:10] == "0")
df$Present <- rowSums(df[,3:10] != "0")
I'm basically just using rowSums to count then number of times a 1, 0, or not a 0 appears from the 3rd to the 10th column, which are the columns of days 1-8. That code gives me this dataframe:
Subjects LayCCD Day.1 Day.2 Day.3 Day.4 Day.5 Day.6 Day.7 Day.8 PresentAlone Absent Present
Subject 1 Day 2 1.0 0.0 0 1 1.0 1.5 200.0 1 4 2 6
Subject 2 Day 5 160.0 1.5 0 0 1.0 1.0 160.0 1 3 2 6
Subject 3 Day 4 0.0 0.0 0 1 1.0 1.0 1.0 0 4 4 4
Subject 4 Day 7 1.5 1.0 1 1 1.5 1.5 1.5 1 4 0 8
This is great so far, but one thing I will need to do is count specific sections of the rows that will vary for each subject. For example, I might only need to count days 5-8 for Subject 1, but days 3-7 for Subject 2, and so on. I am working with a large dataset, so this is not something I want to do manually.
This is where the "Lay CCD" column comes in. As you can see, the Lay CCD column contains a specific day for each subject, ranging from 1-8. What I want to do is reference that value in LayCCD in a rowSums formula so that I can count the same variables as above (1, 0, not a 0) based off of that LayCCD value. For example, if I wanted to count from the LayCCD to 2 days after the LayCCD for Subject 1, I would want the rowSums formula to ONLY count the number of 1s, 0s, and "not 0s" from Day 2 to Day 4 (Day 2 is the LayCCD, and day 4 is 2 days after the LayCCD). If I did that same formula for Subject 2, where I want to count the LayCCD to two days after the Lay CCD, the formula would count from Day 5 to Day 7 (Day 5 is the LayCCD for subject 2, and Day 7 is two days after the lay CCD). I would also probably want to count before the LayCCD as well, maybe something like the 5 days leading up to the LayCCD and including the LayCCD.
Basically, how can I use the rowSums formula to reference the LayCCD value for each row so that I can automatically count specific values from specific sections of each row based on the LayCCD for each Subject?
Sorry if this makes no sense, and please let me know if I need to clarify anything else. I appreciate the help a lot!!!
CodePudding user response:
Nice brain teaser. Even though I have not used rowSums, I believe the following will help (although there must be more efficient ways to do this).
The function ZeroAndOnes_fun() takes the following arguments:
DF: a character string indicating thedata.frameto be evaluated.number_of_days: an integer indicating how many days from LayCCD (including the day indicated by LayCCD itself) to evaluate. For example, if LayCCD equals 'Day 2' for subject i, the function will count the number of 0s and 1s at Day 2, 3, and 4 for subject i.before_LayCCD: a logical indicating whethernumber_of_daysfrom LayCCD should be counted backward or foward from LayCCD. For instance, if LayCCD equals 'Day 3' for subject i andbefore_LayCCDis set toTRUE, the function will count the number of 0s and 1s at Day 3, 2, and 1 for subject i.
ZeroAndOnes_fun <- function(DF, number_of_days, before_LayCCD = FALSE) {
days <- as.numeric(gsub('[^0-9]', '', DF$LayCCD))
id <- rep(seq_len(length(DF$Subjects)), each = number_of_days)
if (before_LayCCD){
days_to_count <- paste0(
'Day.', rep(days, each = number_of_days) - 0:(number_of_days-1L)
)
} else {
days_to_count <- paste0(
'Day.', rep(days, each = number_of_days) 0:(number_of_days-1L)
)
}
out <- data.frame(id, days_to_count)
out_list <- split(out, out$id)
mat <- matrix(numeric(length(out_list) * number_of_days),
ncol = number_of_days)
for (j in seq_along(out_list)) {
for (i in seq_len(number_of_days)) {
col_index <- lapply(out_list, '[', 'days_to_count')[[j]][[1L]][i]
mat[j, i] <- DF[j, col_index]
}
}
fun <- function(x) ifelse(length(x) == 0L, 0, x)
zeros <- ones <- numeric(length(out_list))
for (i in seq_len(length(out_list))) {
my_rle <- rle(mat[i, ])
zeros[i] <- fun(sum(my_rle$lengths[my_rle$values == 0L]))
ones[i] <- fun(sum(my_rle$lengths[my_rle$values == 1L]))
}
return(cbind(DF, zeros, ones))
}
Output
> ZeroAndOnes_fun(DF = df, number_of_days = 2L, before_LayCCD = TRUE)
Subjects LayCCD Day.1 Day.2 Day.3 Day.4 Day.5 Day.6 Day.7 Day.8 Day.9 zeros ones
1 Subject1 Day 3 1.0 0.0 0 1 1.0 1.5 200.0 1 1 2 0
2 Subject2 Day 5 160.0 1.5 0 0 1.0 1.0 160.0 1 1 1 1
3 Subject3 Day 4 0.0 0.0 0 1 1.0 1.0 1.0 0 0 1 1
4 Subject4 Day 7 1.5 1.0 1 1 1.5 1.5 1.5 1 1 0 0
Small benchmark.
# 1000 rows
df <- do.call(rbind, replicate(250, df, simplify = FALSE))
> system.time(ZeroAndOnes_fun(DF = df, number_of_days = 2L, before_LayCCD = TRUE))
user system elapsed
22.85 0.00 22.85
Data
df <- data.frame(
Subjects = paste0('Subject', 1:4),
LayCCD = paste0('Day ', c(3,5,4,7)),
Day.1 = c(1,160,0,1.5),
Day.2 = c(0,1.5,0,1),
Day.3 = c(0,0,0,1),
Day.4 = c(1,0,1,1),
Day.5 = c(1,1,1,1.5),
Day.6 = c(1.5,1,1,1.5),
Day.7 = c(200,160,1,1.5),
Day.8 = c(1,1,0,1),
Day.9 = c(1,1,0,1)
)
