I am trying to calculate a population parameter for multiple species within their respective sample sites. I have a sample of my df structured as:
Dataframe
df<- structure(list(waterbody = c("Homer", "Homer", "Homer", "Homer",
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer",
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer",
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer",
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer",
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer",
"Homer", "Homer", "Homer", "Homer", "Homer", "Homer", "Homer",
"Homer", "Homer", "Homer", "Homer", "Homer"), sample_site = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), species = c("LMB", "LMB", "BLG", "LMB", "BLG", "BLG",
"BLG", "BLG", "BLG", "LMB", "LMB", "LMB", "LMB", "LMB", "BLG",
"BLG", "LMB", "LMB", "BLG", "BLG", "LMB", "LMB", "LMB", "BLG",
"BLG", "BLG", "BLG", "BLG", "BLG", "BLG", "BLG", "BLG", "LMB",
"LMB", "LMB", "BLG", "LMB", "LMB", "LMB", "BLG", "LMB", "LMB",
"LMB", "BLG", "LMB", "BLG", "LMB", "LMB", "BLG", "LMB", "BLG"
), length_mm = c(430L, 430L, 165L, 345L, 128L, 117L, 93L, 135L,
161L, 402L, 347L, 450L, 477L, 255L, 115L, 91L, 445L, 335L, 119L,
124L, 249L, 135L, 361L, 160L, 115L, 130L, 155L, 116L, 158L, 130L,
126L, 158L, 500L, 330L, 150L, 90L, 333L, 404L, 343L, 150L, 285L,
303L, 340L, 120L, 420L, 115L, 295L, 322L, 85L, 145L, 185L), stock = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 1), quality = c(1, 1, 1, 1, 0, 0, 0, 0,
1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,
0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
1)), row.names = c(NA, -51L), class = "data.frame")
This is filtered down to just 2 species in two different sample sites, my full data frame having hundreds of sample sites and 20 species. I want to write a function that sums the total number of quality individuals (represented by a '1' in the column), and divide that by the total number of stock individuals (again, denoted by a '1' in the column). Manually, this looks like:
a<- filter(df, waterbody=="Homer", sample_site==1, species=="LMB", quality==1)
b<- filter(df, waterbody=="Homer", sample_site==1, species=="LMB", stock==1)
(count(a))/(count(b))*100
Resulting in a value of 83.333 ((10 quality/12 stock)*100). However, I want to do this for each species within each sample site. So for sample sites 1 &2, there would be a value ranging from 0-100 for LMB and BLG.
I'm hoping to have the end result be a data frame stuctured as:
results<- structure(list(waterbody = c("Homer", "Homer", "Homer", "Homer",
"Homer", "Homer"), transect = c(1L, 1L, 1L, 2L, 2L, 2L), species = c("BLC",
"BLG", "LMB", "BLC", "BLG", "GSF"), psd = c(50, 31.58, 83.33,
100, 33.33, 0)), row.names = c(NA, 6L), class = "data.frame")
The math that goes into the function is obviously pretty simple, the issues I'm running into is how to apply it to filtered data so that I am not counting, for example, the number of quality individuals across multiple sample sites.
Any help/insight would be greatly appreciated
CodePudding user response:
Here is a dplyr solution:
library(dplyr)
df %>%
group_by(waterbody, sample_site, species) %>%
summarise(psd = (sum(quality==1)/sum(stock == 1))*100)
waterbody sample_site species psd
<chr> <int> <chr> <dbl>
1 Homer 1 BLG 31.6
2 Homer 1 LMB 83.3
3 Homer 2 BLG 33.3
4 Homer 2 LMB 81.8
CodePudding user response:
Can you confirm that
transect(in the expected output) is the same thing assample_site(in the incoming dataset- The expected dataset (which has values for "BLC" species) wasn't produced from the incoming dataset (which doesn't).
If so, dplyr's group_by() and summarize() is all you need.
df |>
dplyr::group_by(waterbody, sample_site, species) |>
dplyr::summarize(
psd = sum(quality) / sum(stock)
) |>
dplyr::ungroup()
Produces
# A tibble: 4 x 4
waterbody sample_site species psd
<chr> <int> <chr> <dbl>
1 Homer 1 BLG 0.316
2 Homer 1 LMB 0.833
3 Homer 2 BLG 0.333
4 Homer 2 LMB 0.818
Before you run that, I suggest verifying that all values of stock and quality are nonmissing and 0/1. checkmate::assert_integerish() is ideal for this.
checkmate::assert_integerish(df$stock , any.missing = FALSE, lower = 0, upper = 1)
checkmate::assert_integerish(df$quality, any.missing = FALSE, lower = 0, upper = 1)
