I have a dataframe like so
ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)
cluster <- c(1,2,3,1)
test <- data.frame(ID, Type1, Type2, cluster)
I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.
I achieved it through
test.sum <- test %>%
group_by(cluster)%>%
summarise(sum(Type1), sum(Type2))
However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?
CodePudding user response:
This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:
test %>%
group_by(cluster) %>%
summarise(across(contains("Type"), sum))
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:
library(dplyr)
library(tidyr)
test %>%
pivot_longer(-c(ID, cluster)) %>%
group_by(cluster, name) %>%
summarise(sum_value = sum(value)) %>%
pivot_wider(names_from = "name", values_from = "sum_value")
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
CodePudding user response:
What about summarise_if:
library(dplyr)
test %>%
group_by(cluster) %>%
summarise_if(is.numeric, sum, na.rm = TRUE)
# A tibble: 3 x 3
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
CodePudding user response:
Base R
You can exploit split which is equivalent to group_by(). This should give you what you are looking for, regardless of how many Types you have.
my_split <- split(subset(test, select = grep('^Ty', names(test))), test[, -1]$cluster)
my_sums <- sapply(my_split, \(x) colSums(x))
my_sums <- data.frame( cluster = as.numeric(gsub("\\D", '', colnames(my_sums))),
t(my_sums) )
Output
> my_sums
cluster Type1 Type2
1 1 2 0
2 2 1 1
3 3 0 1
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
