How do I subset data table columns based on column contents rather than name in R?-CodePudding

I am trying to use RStudio to generate lists of species presence in several regions, where each region contains >10 separate survey plots in which presence/absence of >50 species was recorded. Lists will be character vectors of species names.

Here is a dummy data table in the format I'm using, where presence at a site is indicated by 1 and absence by 0:

dummy_dt <- data.table(region=c("north","north","south","south"),
                    site=c("a","b","a","b"),
                    species_1=c(1,0,0,0),
                    species_2=c(0,1,0,0),
                    species_3=c(0,1,1,1),
                    species_4=c(0,0,1,1))

Species 1, 2, and 3 are present in at least one "north" region site and species 3 and 4 are present in at least one "south" region site. I am interested only in presence/absence data at the regional level and not number or fraction of occupied sites within a region (site codes "a" and "b" are included in dummy_dt to make it clear that each region contains >1 site).

I assume that I will need to subset dummy_dt by region as below before proceeding:

north_dt <- dummy_dt[region == "north"]
south_dt <- dummy_dt[region == "south"]

By hand I can easily generate a species list for each region as a character vector conducive to calculation of a Jaccard similarity coefficient:

north_list <- c("species_1","species_2","species_3")
south_list <- c("species_3","species_4")

Is it possible to automate the generation of character vectors like those above, where elements of the vector are names of columns which contain one or more 1 (either using the subsetted data tables north_dt and south_dt or the original data table dummy_dt)?

CodePudding user response：

tmp <- melt(dummy_dt, id.vars = c("region", "site"), variable.factor = FALSE)[ value > 0,]
tmp
#    region   site  variable value
#    <char> <char>    <char> <num>
# 1:  north      a species_1     1
# 2:  north      b species_2     1
# 3:  north      b species_3     1
# 4:  south      a species_3     1
# 5:  south      b species_3     1
# 6:  south      a species_4     1
# 7:  south      b species_4     1

lapply(split(tmp$variable, tmp$region), unique)
# $north
# [1] "species_1" "species_2" "species_3"
# $south
# [1] "species_3" "species_4"

CodePudding user response：

library(data.table)

cols <- grep('species', names(dummy_dt), value = TRUE)

tmp <- 
  dummy_dt[, .(species = cols[sapply(.SD, \(x) any(x == 1))]), 
           by = region, .SDcols = cols]

tmp
#>    region   species
#>    <char>    <char>
#> 1:  north species_1
#> 2:  north species_2
#> 3:  north species_3
#> 4:  south species_3
#> 5:  south species_4
  
with(tmp, split(species, region))
#> $north
#> [1] "species_1" "species_2" "species_3"
#> 
#> $south
#> [1] "species_3" "species_4"

^{Created on 2022-01-17 by the reprex package (v2.0.1)}