I have a population of 6 categories (stratum) and I want in each stratum to take the 10% as a sample. Doing so I take:
var = c(rep("A",10),rep("B",10),rep("C",3),rep("D",5),"E","F");var
value = rnorm(30)
dat = tibble(var,value);
pop=dat%>%group_by(var)
pop
singleallocperce = slice_sample(pop, prop=0.1);
singleallocperce
with result:
# A tibble: 2 x 2
# Groups: var [2]
var value
<chr> <dbl>
1 A -1.54
2 B -1.12
But I want even if in some stratum that the polupation inside them cannot reach the taken sample of 10% to take at least one observation.How can I do it this in R using dplyr package?
CodePudding user response:
Possible approach (note the presence of 20 x A to check two are returned).
library(tidyverse)
# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)
# Possible approach
dat %>%
group_by(var) %>%
mutate(min = if_else(n() * 0.1 >= 1, n() * 0.1, 1),
random = sample(n())) %>%
filter(random <= min) |>
select(var, value)
#> # A tibble: 7 × 2
#> # Groups: var [6]
#> var value
#> <chr> <dbl>
#> 1 A 0.0105
#> 2 A 0.171
#> 3 B -1.89
#> 4 C 1.89
#> 5 D 0.612
#> 6 E 0.516
#> 7 F 0.185
Created on 2022-06-02 by the reprex package (v2.0.1)
CodePudding user response:
Here is a potential solution:
sample_func <- function(data) {
standard <- data %>%
group_by(var) %>%
slice_sample(prop = 0.1) %>%
ungroup()
if(!all(unique(data$var) %in% unique(standard$var))) {
mins <- data %>%
filter(!var %in% standard$var) %>%
group_by(var) %>%
slice(1) %>%
ungroup()
}
bind_rows(standard, mins)
}
sample_func(dat)
Which gives:
var value
<chr> <dbl>
1 A 1.36
2 B -1.03
3 C -0.0450
4 D -0.380
5 E -0.0556
6 F 0.519
The assumption is that if you are sampling proportionally and do not have any sample for var, that the minimum threshold would be sampling one record from var (by using slice(1)).
CodePudding user response:
data.table
library(data.table)
setDT(dat) # make the tibble a data.table
dat[, .SD[sample((1:.N), fifelse(.N >= 10, .N %/% 10, 1))], var]
results
var value
1: A -0.040487
2: A 0.543354
3: B -1.100892
4: C 0.998006
5: D 0.496898
6: E 0.819967
7: F 0.629236
data
# Data (note 20 As)
var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
value = rnorm(40)
dat = tibble(var, value)
