I have some data that I want to split into 4 equal parts based on the group.
My dataframe looks like this:
| X | Group |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 3 |
| 11 | 3 |
| 12 | 3 |
| 13 | 3 |
| 14 | 3 |
| 15 | 3 |
| 16 | 3 |
Now I thought about adding a thrid column to mark which data belong to which split, like this:
| X | Group | Split |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 1 | 3 |
| 3 | 1 | 2 |
| 4 | 1 | 4 |
| 5 | 1 | 4 |
| 6 | 1 | 2 |
| 7 | 2 | 3 |
| 8 | 2 | 1 |
| 9 | 3 | 1 |
| 10 | 3 | 2 |
| 11 | 3 | 3 |
| 12 | 3 | 4 |
| 13 | 3 | 1 |
| 14 | 3 | 2 |
| 15 | 3 | 3 |
| 16 | 3 | 4 |
I don't need to actually split the dataset, because the data are videos and I just have to mark how (which person) has to watch them.
I know how I can generate random numbers, but I need them to be stratified to the group.
I know how I can get a stratified sample, but thats not I want, because I want to distribute ALL data (videos in this case) but in a stratified fashion.
Can you help me how to achieve this?
Thank you!
edit: I changed to example to unequally sized groups.
CodePudding user response:
You can easily do these kind of stratified operations using dplyr::group_by():
library(tidyverse)
df <- data.frame(
X = 1:12,
Group = c(rep(1,4), rep(2,4), rep(3,4))
)
df %>%
group_by(Group) %>%
mutate(Split = sample(seq_along(X), size = n(), replace = FALSE) %% 4 1) %>%
ungroup()
