Select the optimal number based on conditions-CodePudding

This is my minimal dataset:

df=structure(list(ID = c(3942504L, 3199413L, 1864266L, 4037617L, 
2030477L, 1342330L, 5434070L, 3200378L, 4810153L, 4886225L), 
    MI_TIME = c(1101L, 396L, 1140L, 417L, 642L, 1226L, 1189L, 
    484L, 766L, 527L), MI_Status = c(0L, 0L, 1L, 0L, 0L, 0L, 
    0L, 0L, 1L, 0L), Stroke_status = c(1L, 0L, 1L, 0L, 0L, 0L, 
    0L, 1L, 1L, 0L), Stroke_time = c(1101L, 396L, 1140L, 417L, 
    642L, 1226L, 1189L, 484L, 766L, 527L), Arrhythmia_status = c(NA, 
    NA, TRUE, NA, NA, TRUE, NA, NA, TRUE, NA), Arrythmia_time = c(1101L, 
    356L, 1122L, 7L, 644L, 126L, 118L, 84L, 76L, 5237L)), row.names = c(NA, 
10L), class = "data.frame")

As you can see, I have mainly 2 types of variables "_status" and "_time".

I am preparing my dataset for a survival analysis, and "time" is time to event in days.

But the problem arrives when I try to create a variable called "any cardiovascular outcome" (df$CV) That I have defined as the following:

df$CV = NA
df$CV <- with(df, ifelse(MI_Status=='1' | Stroke_status=='1' | Arrhythmia_status== 'TRUE'  ,'1', '0'))              
df$CV = as.factor(df$CV)

The problem I have is with selecting the optimal time to event. As now I have a new variable called df$CV, but 3 different "_time" variables. So I would like to create a new column, called df$CV_time where time, is the time for the event that happened first. There is a slight difficulty in this problem though, and I put an example:

If we have a subject with MI_status==1, Arrythmia_status==NA, stroke_status==1 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 200, as it is the time for the earliest event.

However, in a case where MI_status==0, Arrythmia_status==NA, stroke_status==0 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 220, as it is the time for latest follow up is 220 days.

How could I select the optimal number for df$CV based on these conditions?

CodePudding user response：

This might be one approach using tidyverse.

First, you may want to make sure your column names are consistent with spelling and case (here using rename).

Then, you can explicitly define your "Arrhythmia" outcome as TRUE or FALSE (instead of using NA).

You can put your data into long form with pivot_longer, and then group_by your ID. You can include the specific columns related to MI, stroke, and arrhythmia here (where there are "time" and "status" columns available). Note that in your actual dataset (where you use glimpse - it is unclear what you want for arrhythmia - there's a pif column name, but nothing specific for time or status).

Your cardiovascular outcome will include status for MI or Stroke that is 1, or Arrhythmia that is TRUE.

The time to event would be the min time if there was a cardiovascular outcome, otherwise use the censored time of latest follow up or max time.

Let me know if this gives you the desired output.

library(tidyverse)

df %>%
  rename(MI_time = MI_TIME, MI_status = MI_Status, Arrhythmia_time = Arrythmia_time) %>%
  replace_na(list(Arrhythmia_status = F)) %>%
  pivot_longer(cols = c(starts_with("MI_"), starts_with("Stroke_"), starts_with("Arrhythmia_")), 
               names_to = c("event", ".value"), 
               names_sep = "_") %>%
  group_by(ID) %>%
  summarise(
    any_cv_outcome = any(status[event %in% c("MI", "Stroke")] == 1 | status[event == "Arrhythmia"]),
    cv_time_to_event = ifelse(any_cv_outcome, min(time), max(time))
  )

Output

        ID any_cv_outcome cv_time_to_event
     <int> <lgl>                     <int>
 1 1342330 TRUE                        126
 2 1864266 TRUE                       1122
 3 2030477 FALSE                       644
 4 3199413 FALSE                       396
 5 3200378 TRUE                         84
 6 3942504 TRUE                       1101
 7 4037617 FALSE                       417
 8 4810153 TRUE                         76
 9 4886225 FALSE                      5237
10 5434070 FALSE                      1189