I am trying to use RStudio to generate lists of species presence in several regions, where each region contains >10 separate survey plots in which presence/absence of >50 species was recorded. Lists will be character vectors of species names.
Here is a dummy data table in the format I'm using, where presence at a site is indicated by 1 and absence by 0:
dummy_dt <- data.table(region=c("north","north","south","south"),
site=c("a","b","a","b"),
species_1=c(1,0,0,0),
species_2=c(0,1,0,0),
species_3=c(0,1,1,1),
species_4=c(0,0,1,1))
Species 1, 2, and 3 are present in at least one "north" region site and species 3 and 4 are present in at least one "south" region site. I am interested only in presence/absence data at the regional level and not number or fraction of occupied sites within a region (site codes "a" and "b" are included in dummy_dt to make it clear that each region contains >1 site).
I assume that I will need to subset dummy_dt by region as below before proceeding:
north_dt <- dummy_dt[region == "north"]
south_dt <- dummy_dt[region == "south"]
By hand I can easily generate a species list for each region as a character vector conducive to calculation of a Jaccard similarity coefficient:
north_list <- c("species_1","species_2","species_3")
south_list <- c("species_3","species_4")
Is it possible to automate the generation of character vectors like those above, where elements of the vector are names of columns which contain one or more 1 (either using the subsetted data tables north_dt and south_dt or the original data table dummy_dt)?
CodePudding user response:
tmp <- melt(dummy_dt, id.vars = c("region", "site"), variable.factor = FALSE)[ value > 0,]
tmp
# region site variable value
# <char> <char> <char> <num>
# 1: north a species_1 1
# 2: north b species_2 1
# 3: north b species_3 1
# 4: south a species_3 1
# 5: south b species_3 1
# 6: south a species_4 1
# 7: south b species_4 1
lapply(split(tmp$variable, tmp$region), unique)
# $north
# [1] "species_1" "species_2" "species_3"
# $south
# [1] "species_3" "species_4"
CodePudding user response:
library(data.table)
cols <- grep('species', names(dummy_dt), value = TRUE)
tmp <-
dummy_dt[, .(species = cols[sapply(.SD, \(x) any(x == 1))]),
by = region, .SDcols = cols]
tmp
#> region species
#> <char> <char>
#> 1: north species_1
#> 2: north species_2
#> 3: north species_3
#> 4: south species_3
#> 5: south species_4
with(tmp, split(species, region))
#> $north
#> [1] "species_1" "species_2" "species_3"
#>
#> $south
#> [1] "species_3" "species_4"
Created on 2022-01-17 by the reprex package (v2.0.1)
