Home > Blockchain >  How can I split a large dataset and remove the variable that it was split by [R]
How can I split a large dataset and remove the variable that it was split by [R]

Time:01-30

I'd like to split my dataset using the variable group and then remove that variable from the resulting dataset. Right now, I'm using a for loop, but I'm looking for something that avoids a loop and something in base R without loading dplyr or a similar package.

n <- 10
x <- runif(n)*10
y <- runif(n)*10
group <- rep(1:2, each=5)

my_data <- as.data.frame(cbind(group, x, y))
subset_data <- split(my_data, my_data$group, drop=TRUE)


drop_column <- "group"
for (i in 1:length(unique(group))){
  subset_data[[i]] <- subset_data[[i]][,!(names(subset_data[[i]]) %in% drop_column)]
}

Thank you.

CodePudding user response:

A base R option using subset inside lapply. You can use split and remove the grouping variable all in one step.

lapply(split(my_data, my_data$group, drop=TRUE), subset, select = -group)

Output

$`1`
         x         y
1 3.421037 0.2846179
2 9.219159 5.0449367
3 4.157628 1.3970608
4 3.412703 2.2196774
5 9.948763 6.5528746

$`2`
           x         y
6  0.3746215 3.4387533
7  3.0722134 0.5371084
8  3.0580508 0.4649525
9  3.6308661 6.5796197
10 6.4435513 3.0641620

CodePudding user response:

You can use group_split from dplyr and sett the keep parameter to FALSE:

library(dplyr)
subset_data <- my_data |>
  group_split(group, .keep = FALSE)

<list_of<
  tbl_df<
    x: double
    y: double
  >
>[2]>
[[1]]
# A tibble: 5 x 2
      x     y
  <dbl> <dbl>
1  9.43  1.84
2  2.34  9.41
3  6.96  7.56
4  7.91  5.11
5  1.52  3.38

[[2]]
# A tibble: 5 x 2
       x     y
   <dbl> <dbl>
1 2.71   6.14 
2 0.959  8.13 
3 0.0337 0.315
4 1.26   8.30 
5 4.73   0.122

CodePudding user response:

The idea is borrowed from Delete a column in a data frame within a list

n <- 10
x <- runif(n)*10
y <- runif(n)*10
group <- rep(1:2, each=5)

my_data <- as.data.frame(cbind(group, x, y))
subset_data <- split(my_data, my_data$group, drop=TRUE)

lapply(subset_data, function(x) x[!(names(x) %in% "group")])
$`1`
         x        y
1 3.323947 3.337749
2 6.508705 4.763512
3 2.580168 8.921983
4 4.785452 8.643395
5 7.663107 3.899895

$`2`
           x        y
6  0.8424691 7.773207
7  8.7532133 9.606180
8  3.3907294 4.346595
9  8.3944035 7.125147
10 3.4668349 3.999944
  •  Tags:  
  • Related