How to sum a set of columns grouped by one column-CodePudding

I have a dataframe like so

ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)     
cluster <- c(1,2,3,1)

test <- data.frame(ID, Type1, Type2, cluster)

I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.

I achieved it through

test.sum <- test %>%
  group_by(cluster)%>%
  summarise(sum(Type1), sum(Type2))

However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?

CodePudding user response：

This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:

test %>% 
    group_by(cluster) %>% 
    summarise(across(contains("Type"), sum))

  cluster Type1 Type2
    <dbl> <dbl> <dbl>
1       1     2     0
2       2     1     1
3       3     0     1

Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:

library(dplyr)
library(tidyr)

test %>% 
    pivot_longer(-c(ID, cluster)) %>% 
    group_by(cluster, name) %>% 
    summarise(sum_value = sum(value)) %>% 
    pivot_wider(names_from = "name", values_from = "sum_value")

  cluster Type1 Type2
    <dbl> <dbl> <dbl>
1       1     2     0
2       2     1     1
3       3     0     1

CodePudding user response：

What about summarise_if:

library(dplyr)
test %>%
  group_by(cluster) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE)

# A tibble: 3 x 3
  cluster Type1 Type2
    <dbl> <dbl> <dbl>
1       1     2     0
2       2     1     1
3       3     0     1

CodePudding user response：

Base R

You can exploit split which is equivalent to group_by(). This should give you what you are looking for, regardless of how many Types you have.

my_split <- split(subset(test, select = grep('^Ty', names(test))), test[, -1]$cluster)
my_sums <- sapply(my_split, \(x) colSums(x))
my_sums <- data.frame( cluster = as.numeric(gsub("\\D", '', colnames(my_sums))),
                       t(my_sums) )

Output

> my_sums
  cluster Type1 Type2
1       1     2     0
2       2     1     1
3       3     0     1

Note: use function(x) instead of \(x) if you use a version of R <4.1.0