How can I find the percentage of different groups within a column?-CodePudding

I have a dataframe (sy2.1) with two columns (Vannstand2Positive and Vannstand2Negative) that both contain NA's as a result of mutating the columns from an existing column. I also have a third column (t) with time and date, and a fourth column (Kategori) with 4 categories. I want to find the percentage of each of the 4 categories with values from Vannstand2Positive and Vannstand2Negative.

This is the code I have used, in which I only use the column with positive values (Vannstand2Positive) in this code.

sy2.1 %>% 
  filter(year(t)==2005) %>% 
  group_by(Kategori, year(t)) %>% 
  dplyr::summarise(sum_vannstandPos = sum(Vannstand2Positive)) %>% 
  dplyr::mutate(percent_vannstandPos=sum_vannstandPos/sum(sum_vannstandPos)*100)


#    Kategori year(t) sum_VannstandPos percent_VannstandPos
# 1:  Liten    2005           NA                 NA
# 2: Moderat   2005          1107               100
# 1:  Stor     2005           609               100
# 2: Størst    2005          1107               100

There is clearly something wrong with the code, as the percentage gives 100% on all the categories, but I need the correct calculations of the percentage. I tried to also use it on the column with negative values (Vannstand2Negative), but I only got NA's on the table. But I know its fully possible to find percentages of negative values. Anyone with a code that can fix this?

By adding na.rm=T to the sum function, I got percentages of almost all groups except the first group:

#    Kategori year(t) sum_VannstandPos percent_VannstandPos
# 1:  Liten    2005           NA                 NA
# 2: Moderat   2005          1107               32.34005
# 1:  Stor     2005           609               17.79141
# 2: Størst    2005          1107               49.86854

CodePudding user response：

Use na.rm = TRUE to drop the NA values from the calculation in sum. You can use .groups = 'drop' in summarise to drop the groupings and since you are using year(t) more than once it is better to create a new column with the year information.

library(dplyr)
library(lubridate)

result <- sy2.1 %>% 
  mutate(year = year(t)) %>%
  filter(year==2005) %>% 
  group_by(Kategori, year) %>% 
  dplyr::summarise(sum_vannstandPos = sum(Vannstand2Positive, na.rm = TRUE), 
                  .groups = 'drop') %>% 
  dplyr::mutate(percent_vannstandPos= prop.table(sum_vannstandPos)*100)

result

prop.table(sum_vannstandPos) is another way to write sum_vannstandPos/sum(sum_vannstandPos).

CodePudding user response：

Adding ungroup() as in the reproducab le example below.

iris %>% 
  group_by(Species) %>% 
  summarize(sum_Petal.Length = sum(Petal.Length, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(
    precentage = sum_Petal.Length / sum(sum_Petal.Length, na.rm = TRUE)
  )