Home > Back-end >  Writing a function across multiple subgroups
Writing a function across multiple subgroups

Time:01-21

I am trying to calculate a population parameter for multiple species within their respective sample sites. I have a sample of my df structured as:

Dataframe

df<- structure(list(waterbody = c("Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer", "Homer", "Homer", "Homer"), sample_site = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), species = c("LMB", "LMB", "BLG", "LMB", "BLG", "BLG", 
"BLG", "BLG", "BLG", "LMB", "LMB", "LMB", "LMB", "LMB", "BLG", 
"BLG", "LMB", "LMB", "BLG", "BLG", "LMB", "LMB", "LMB", "BLG", 
"BLG", "BLG", "BLG", "BLG", "BLG", "BLG", "BLG", "BLG", "LMB", 
"LMB", "LMB", "BLG", "LMB", "LMB", "LMB", "BLG", "LMB", "LMB", 
"LMB", "BLG", "LMB", "BLG", "LMB", "LMB", "BLG", "LMB", "BLG"
), length_mm = c(430L, 430L, 165L, 345L, 128L, 117L, 93L, 135L, 
161L, 402L, 347L, 450L, 477L, 255L, 115L, 91L, 445L, 335L, 119L, 
124L, 249L, 135L, 361L, 160L, 115L, 130L, 155L, 116L, 158L, 130L, 
126L, 158L, 500L, 330L, 150L, 90L, 333L, 404L, 343L, 150L, 285L, 
303L, 340L, 120L, 420L, 115L, 295L, 322L, 85L, 145L, 185L), stock = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 0, 1), quality = c(1, 1, 1, 1, 0, 0, 0, 0, 
1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 
0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 
1)), row.names = c(NA, -51L), class = "data.frame")

This is filtered down to just 2 species in two different sample sites, my full data frame having hundreds of sample sites and 20 species. I want to write a function that sums the total number of quality individuals (represented by a '1' in the column), and divide that by the total number of stock individuals (again, denoted by a '1' in the column). Manually, this looks like:

a<- filter(df, waterbody=="Homer", sample_site==1, species=="LMB", quality==1)
b<- filter(df, waterbody=="Homer", sample_site==1, species=="LMB", stock==1)

(count(a))/(count(b))*100

Resulting in a value of 83.333 ((10 quality/12 stock)*100). However, I want to do this for each species within each sample site. So for sample sites 1 &2, there would be a value ranging from 0-100 for LMB and BLG.

I'm hoping to have the end result be a data frame stuctured as:

results<- structure(list(waterbody = c("Homer", "Homer", "Homer", "Homer", 
"Homer", "Homer"), transect = c(1L, 1L, 1L, 2L, 2L, 2L), species = c("BLC", 
"BLG", "LMB", "BLC", "BLG", "GSF"), psd = c(50, 31.58, 83.33, 
100, 33.33, 0)), row.names = c(NA, 6L), class = "data.frame")

The math that goes into the function is obviously pretty simple, the issues I'm running into is how to apply it to filtered data so that I am not counting, for example, the number of quality individuals across multiple sample sites.

Any help/insight would be greatly appreciated

CodePudding user response:

Here is a dplyr solution:

library(dplyr)
df %>% 
  group_by(waterbody, sample_site, species) %>% 
  summarise(psd = (sum(quality==1)/sum(stock == 1))*100)
  waterbody sample_site species   psd
  <chr>           <int> <chr>   <dbl>
1 Homer               1 BLG      31.6
2 Homer               1 LMB      83.3
3 Homer               2 BLG      33.3
4 Homer               2 LMB      81.8

CodePudding user response:

Can you confirm that

  1. transect (in the expected output) is the same thing as sample_site (in the incoming dataset
  2. The expected dataset (which has values for "BLC" species) wasn't produced from the incoming dataset (which doesn't).

If so, dplyr's group_by() and summarize() is all you need.

df |> 
  dplyr::group_by(waterbody, sample_site, species) |> 
  dplyr::summarize(
    psd = sum(quality) / sum(stock)
  ) |> 
  dplyr::ungroup()

Produces

# A tibble: 4 x 4
  waterbody sample_site species   psd
  <chr>           <int> <chr>   <dbl>
1 Homer               1 BLG     0.316
2 Homer               1 LMB     0.833
3 Homer               2 BLG     0.333
4 Homer               2 LMB     0.818

Before you run that, I suggest verifying that all values of stock and quality are nonmissing and 0/1. checkmate::assert_integerish() is ideal for this.

checkmate::assert_integerish(df$stock  , any.missing = FALSE, lower = 0, upper = 1)
checkmate::assert_integerish(df$quality, any.missing = FALSE, lower = 0, upper = 1)
  •  Tags:  
  • Related