I am trying to use a t.test() to compare means for multiple columns in a dataframe where the values to compare are within each column. I have several columns of metadata (Date,Assay,Timing) for each row. My data looks like df below where the data collected is not paired and meas1 and meas2 are different measured outcomes that are not related. The comparison I am trying to make is between meas1[Timing=="Start"] and meas1[Timing == "End"] on each date, for each assay, and each test. My actual data has ~10 columns of measurement data which affects my syntax for some of the subsetting.
library(tidyverse)
df <- data.frame(Date=rep(c("2022-01-01","2022-01-02"), each = 18),
Assay = rep(c("Gly", "Asp", "Con"), each = 3, times = 4),
Timing = c(rep("Start",9),rep("End",9)),
meas1=round(rnorm(36,5,3),0),
meas2=round(rnorm(36,8,9),0))
I have tried a couple of different approaches. One is to attempt to use inner_join() and pivot_longer() to join the data together using a separate dataframe of the metadata, but I am not getting the desired outcome.
comp <- list(Assay = c("Gly","Asp","Con"),
first = "Start",
last = "End",
test = names(df %>% select(-Date,-Assay,-Timing))) %>%
cross_df()
df_pivot <- df %>%
pivot_longer(c(-Date,-Assay,-Timing), names_to = "test")
t_tests <- comp %>%
inner_join(df_pivot, by = c("Assay", "test", "first"="Timing")) %>%
rename(initial = value) %>%
inner_join(df_pivot, by = c("Date", "Assay", "test", "last"="Timing")) %>%
rename(final = value)
t_tests
# A tibble: 108 × 7
Assay first last test Date initial final
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Gly Start End meas1 2022-01-01 8 8
2 Gly Start End meas1 2022-01-01 8 9
3 Gly Start End meas1 2022-01-01 8 4
4 Gly Start End meas1 2022-01-01 4 8
5 Gly Start End meas1 2022-01-01 4 9
6 Gly Start End meas1 2022-01-01 4 4
7 Gly Start End meas1 2022-01-01 -1 8
8 Gly Start End meas1 2022-01-01 -1 9
9 Gly Start End meas1 2022-01-01 -1 4
10 Gly Start End meas1 2022-01-02 6 1
# … with 98 more rows
# ℹ Use `print(n = ...)` to see more rows
The initial values are repeated for each distinct final value, which is not what I want since the data aren't paired. I am trying to get only 36 rows: 2 dates, 3 assays, 2 tests, 6 values for each test (3 values by 2 columns). In other words, rows 1:9 should be condensed to 3 rows (row 1, 5, and 9) containing only the unique initial and final values. This is where I need help. The 1,5,9 pattern should repeat but I was hoping to avoid slicing the data after the fact.
Assuming that part was done correctly, I'd proceed as follows, which gives me the summary tibble of t.test() outcomes that I want:
t_tests <- t_tests %>%
mutate(first = NULL, last = NULL) %>%
group_by(Date,Assay,test) %>%
group_modify(~broom::tidy(t.test(.x$initial,.x$final))) %>% ungroup()
t_tests
# A tibble: 12 × 13
Date Assay test estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 2022-01-01 Asp values1 -2.33 5.67 8 -1.79 0.0989 11.7 -5.18 0.511 Welch Two Sample t-test two.sided
2 2022-01-01 Asp values2 9.33 6.67 -2.67 2.14 0.0643 8.16 -0.698 19.4 Welch Two Sample t-test two.sided
3 2022-01-01 Con values1 3.67 6.67 3 2.17 0.0552 10.1 -0.0984 7.43 Welch Two Sample t-test two.sided
4 2022-01-01 Con values2 -8.33 2.67 11 -2.93 0.0110 14.1 -14.4 -2.23 Welch Two Sample t-test two.sided
5 2022-01-01 Gly values1 0.333 5 4.67 0.343 0.737 13.1 -1.76 2.43 Welch Two Sample t-test two.sided
6 2022-01-01 Gly values2 -0.333 5.67 6 -0.100 0.922 11.5 -7.63 6.96 Welch Two Sample t-test two.sided
7 2022-01-02 Asp values1 2 6 4 1.36 0.193 16 -1.12 5.12 Welch Two Sample t-test two.sided
8 2022-01-02 Asp values2 11 11.7 0.667 2.02 0.0731 9.27 -1.26 23.3 Welch Two Sample t-test two.sided
9 2022-01-02 Con values1 -2 4.33 6.33 -1.75 0.0999 15.4 -4.43 0.429 Welch Two Sample t-test two.sided
10 2022-01-02 Con values2 11 11.3 0.333 5.64 0.0000761 13.2 6.79 15.2 Welch Two Sample t-test two.sided
11 2022-01-02 Gly values1 -2.33 3 5.33 -4.43 0.000594 13.8 -3.47 -1.20 Welch Two Sample t-test two.sided
12 2022-01-02 Gly values2 1 6 5 0.267 0.793 14.5 -7.00 9.00 Welch Two Sample t-test two.sided
Thanks in advance!
CodePudding user response:
You need to add a run_id within each date/assay/timing group so that you can match use that as a join criteria to avoid the duplication.
There's clue, when you say
I am trying to get only 36 rows: 2 dates, 3 assays, 2 tests, 6 values for each test (3 values by 2 columns)
You have a date column with 2 unique dates, an assay column with 3 unique assays, a test column with 2 unique tests.... you also need a column with the 3 unique values for your "3 values by 2 columns". I'll call the column run_id.
I'll also skip the comp data frame, and essentially do a self-join:
pivot2 = df %>%
group_by(Date, Assay, Timing) %>%
mutate(run_id = row_number()) %>%
ungroup() %>%
pivot_longer(starts_with("meas"), names_to = "test")
t_tests =
full_join(
filter(pivot2, Timing == "Start") %>% select(-Timing, initial = value),
filter(pivot2, Timing == "End") %>% select(-Timing, final = value),
by = c("Date", "Assay", "run_id", "test")
)
# # A tibble: 36 × 6
# Date Assay run_id test initial final
# <chr> <chr> <int> <chr> <dbl> <dbl>
# 1 2022-01-01 Gly 1 meas1 1 -1
# 2 2022-01-01 Gly 1 meas2 4 7
# 3 2022-01-01 Gly 2 meas1 0 1
# 4 2022-01-01 Gly 2 meas2 10 8
# 5 2022-01-01 Gly 3 meas1 8 5
# 6 2022-01-01 Gly 3 meas2 -16 4
# 7 2022-01-01 Asp 1 meas1 6 7
# 8 2022-01-01 Asp 1 meas2 28 -5
# 9 2022-01-01 Asp 2 meas1 4 6
# 10 2022-01-01 Asp 2 meas2 9 9
# # … with 26 more rows
# # ℹ Use `print(n = ...)` to see more rows
I use a full_join so that even if one data/assay/timing combo has a different number of runs, everything will still be included.
