In a previous question (Creating a New Variable for Each Unique Group), I learned how to use the "pivot_longer()" function in R for formatting data:
v1 <- c("2010-2011","2011-2012", "2012-2013", "2013-2014", "2014-2015")
v2 <- c("A", "B", "C", "D", "E")
data_1 = data.frame(var_1 = rnorm(871, 10,10), var_2 = rnorm(871, 5,5))
data_1$dates <- as.factor(sample(v1, 871, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
data_1$types <- as.factor(sample(v2, 871, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))
data_2 = data.frame(var_1 = rnorm(412, 10,10), var_2 = rnorm(412, 5,5))
data_2$dates <- as.factor(sample(v1, 412, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
data_2$types <- as.factor(sample(v2, 412, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))
data_3 = data.frame(var_1 = rnorm(332, 10,10), var_2 = rnorm(332, 5,5))
data_3$dates <- as.factor(sample(v1, 332, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
data_3$types <- as.factor(sample(v2, 332, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))
library(tidyverse)
list(data_1, data_2, data_3) %>%
set_names(paste0("data_", 1:length(.))) %>%
bind_rows(.id = "data_nr") %>%
count(data_nr, dates, name = "my_counts") %>%
pivot_wider(names_from = dates, values_from = my_counts, names_prefix = "counts_")
This produces the following data set:
# A tibble: 3 x 6
data_nr `counts_2010-2011` `counts_2011-2012` `counts_2012-2013` `counts_2013-2014` `counts_2014-2015`
<chr> <int> <int> <int> <int> <int>
1 data_1 443 171 83 93 81
2 data_2 200 78 44 47 43
3 data_3 172 61 32 33 34
I would like to add several more columns to the above table, such as:
mean_var1_2010-2011mean_var1_2011-2012etc.mean_var2_2010-2011mean_var2_2011-2012etc.30thquantile_var1_2010-201130thquantile_var1_2011-2012etc.30thquantile_var2_2010-201130thquantile_var2_2011-2012etc.
I tried to modify the above code to do this:
#put everything in one data frame to make it easier
my_data = rbind(data_1, data_2, data_3 )
final = my_data %>%
bind_rows(.id = "data_nr") %>%
count(data_nr, dates, name = "my_counts") %>%
mean(data_nr, dates, var_1, name = "my_mean_var_1") %>%
mean(data_nr, dates, var_2, name = "my_mean_var_2") %>%
quantile(data_nr, dates, var_1, probs = 0.3, name = "my_30_percentile_var_1") %>%
quantile(data_nr, dates, var_2, probs = 0.3, name = "my_30_percentile_var_2") %>%
pivot_wider(names_from = dates, values_from = c(my_counts, my_mean_var_1, my_mean_var_2, my_30_percentile_var_1, my_30_percentile_var_2), names_prefix = "counts_")
But I don't think this is the correct way to do this
Can someone please show me how to do this?
Thanks!
CodePudding user response:
Suppose you have a dataframe dt looking like this after bind_rows.
# A tibble: 1,615 x 5
data_nr var_1 var_2 dates types
<chr> <dbl> <dbl> <fct> <fct>
1 data_1 13.4 10.1 2010-2011 C
2 data_1 -0.0313 4.23 2010-2011 D
3 data_1 4.41 -3.36 2010-2011 B
4 data_1 7.73 9.03 2012-2013 A
5 data_1 2.18 5.73 2013-2014 A
6 data_1 7.55 -2.09 2010-2011 C
7 data_1 13.8 0.0569 2010-2011 D
8 data_1 18.4 1.97 2010-2011 A
9 data_1 9.22 6.62 2013-2014 C
10 data_1 -7.85 -1.92 2011-2012 E
# ... with 1,605 more rows
Then your tasks can be done more efficiently using summarize together with across. You can do
library(dplyr)
library(tidyr)
dt %>%
group_by(data_nr, dates) %>%
summarize(
counts = n(),
across(
.cols = c(var_1, var_2),
.fns = list(mean = mean, `30thquantile` = ~quantile(., probs = 0.3)),
.names = "{.fn}_{.col}"
),
.groups = "drop"
) %>%
pivot_wider(names_from = dates, values_from = -c(data_nr, dates))
Output
# A tibble: 3 x 26
data_nr `counts_2010-2011` `counts_2011-2012` `counts_2012-2013` `counts_2013-2014` `counts_2014-2015` `mean_var_1_2010-2011`
<chr> <int> <int> <int> <int> <int> <dbl>
1 data_1 439 172 81 83 96 10.9
2 data_2 203 70 43 57 39 10.6
3 data_3 180 64 29 31 28 9.34
# ... with 19 more variables: mean_var_1_2011-2012 <dbl>, mean_var_1_2012-2013 <dbl>, mean_var_1_2013-2014 <dbl>,
# mean_var_1_2014-2015 <dbl>, 30thquantile_var_1_2010-2011 <dbl>, 30thquantile_var_1_2011-2012 <dbl>,
# 30thquantile_var_1_2012-2013 <dbl>, 30thquantile_var_1_2013-2014 <dbl>, 30thquantile_var_1_2014-2015 <dbl>,
# mean_var_2_2010-2011 <dbl>, mean_var_2_2011-2012 <dbl>, mean_var_2_2012-2013 <dbl>, mean_var_2_2013-2014 <dbl>,
# mean_var_2_2014-2015 <dbl>, 30thquantile_var_2_2010-2011 <dbl>, 30thquantile_var_2_2011-2012 <dbl>,
# 30thquantile_var_2_2012-2013 <dbl>, 30thquantile_var_2_2013-2014 <dbl>, 30thquantile_var_2_2014-2015 <dbl>
