Home > Back-end >  How to sum a set of columns grouped by one column
How to sum a set of columns grouped by one column

Time:01-11

I have a dataframe like so

ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)     
cluster <- c(1,2,3,1)

test <- data.frame(ID, Type1, Type2, cluster)

I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.

I achieved it through

test.sum <- test %>%
  group_by(cluster)%>%
  summarise(sum(Type1), sum(Type2)) 

However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?

CodePudding user response:

This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:

test %>% 
    group_by(cluster) %>% 
    summarise(across(contains("Type"), sum))
  cluster Type1 Type2
    <dbl> <dbl> <dbl>
1       1     2     0
2       2     1     1
3       3     0     1

Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:

library(dplyr)
library(tidyr)

test %>% 
    pivot_longer(-c(ID, cluster)) %>% 
    group_by(cluster, name) %>% 
    summarise(sum_value = sum(value)) %>% 
    pivot_wider(names_from = "name", values_from = "sum_value")
  cluster Type1 Type2
    <dbl> <dbl> <dbl>
1       1     2     0
2       2     1     1
3       3     0     1

CodePudding user response:

What about summarise_if:

library(dplyr)
test %>%
  group_by(cluster) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE)

# A tibble: 3 x 3
  cluster Type1 Type2
    <dbl> <dbl> <dbl>
1       1     2     0
2       2     1     1
3       3     0     1

CodePudding user response:

Base R

You can exploit split which is equivalent to group_by(). This should give you what you are looking for, regardless of how many Types you have.

my_split <- split(subset(test, select = grep('^Ty', names(test))), test[, -1]$cluster)
my_sums <- sapply(my_split, \(x) colSums(x))
my_sums <- data.frame( cluster = as.numeric(gsub("\\D", '', colnames(my_sums))),
                       t(my_sums) )

Output

> my_sums
  cluster Type1 Type2
1       1     2     0
2       2     1     1
3       3     0     1

Note: use function(x) instead of \(x) if you use a version of R <4.1.0

  •  Tags:  
  • Related