How can I create severals DF from factor label?-CodePudding

I'm pretty new to using R and I'm facing a problem. I have a general DF from which I want to create several DFs, one for each category of a factor type variable, also applying some condition. Do you know how I can do this?

The idea is this:

Gen_DF with, for example, a variable called "income" with labels "High", "Medium", "Low".

I want to get 3 more DF, one called High, other Medium and other Low, but adding a condition because the data has a lot of outliers for each category that I want to exclude (i.e., 95% percent of data for each category is ok). I know how do it one by one, but the problem is that my real DF doesn't have 3 categories but 21 and it's quite boring doing it one by one.

High <- Gen_DF %>% subset(income=="High") %>% subset(wages < quantile(wages, 0.95))

I need something like this but... automated.

Thanks!

CodePudding user response：

I base R we can use split to create a list of data.frames. Below we supply iris$Species as factor to split() to get a list of three data.frames for each factor level of Species. We can then filter each data.frame with lapply and subset. Finally, we can bind the list back together to one big data.frame with do.call and rbind if needed.

iris |> 
  split(iris$Species) |>
  lapply(FUN = \(x) subset(x, Sepal.Length < 5)) |> 
  do.call(what = "rbind") # this line is optional in case we want the result in one df

#>            Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> setosa.2            4.9         3.0          1.4         0.2     setosa
#> setosa.3            4.7         3.2          1.3         0.2     setosa
#> setosa.4            4.6         3.1          1.5         0.2     setosa
#> setosa.7            4.6         3.4          1.4         0.3     setosa
#> setosa.9            4.4         2.9          1.4         0.2     setosa
#> setosa.10           4.9         3.1          1.5         0.1     setosa
#> setosa.12           4.8         3.4          1.6         0.2     setosa
#> setosa.13           4.8         3.0          1.4         0.1     setosa
#> setosa.14           4.3         3.0          1.1         0.1     setosa
#> setosa.23           4.6         3.6          1.0         0.2     setosa
#> setosa.25           4.8         3.4          1.9         0.2     setosa
#> setosa.30           4.7         3.2          1.6         0.2     setosa
#> setosa.31           4.8         3.1          1.6         0.2     setosa
#> setosa.35           4.9         3.1          1.5         0.2     setosa
#> setosa.38           4.9         3.6          1.4         0.1     setosa
#> setosa.39           4.4         3.0          1.3         0.2     setosa
#> setosa.42           4.5         2.3          1.3         0.3     setosa
#> setosa.43           4.4         3.2          1.3         0.2     setosa
#> setosa.46           4.8         3.0          1.4         0.3     setosa
#> setosa.48           4.6         3.2          1.4         0.2     setosa
#> versicolor          4.9         2.4          3.3         1.0 versicolor
#> virginica           4.9         2.5          4.5         1.7  virginica

^{Created on 2022-01-28 by the reprex package (v2.0.1)}

The above can also be done in {dplyr}.

To get one big data.frame we use bind_rows:

library(dplyr)

iris %>% 
  group_by(Species) %>% 
  filter(Sepal.Length < 5) %>% 
  bind_rows

#> # A tibble: 22 x 5
#> # Groups:   Species [3]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          4.9         3            1.4         0.2 setosa 
#>  2          4.7         3.2          1.3         0.2 setosa 
#>  3          4.6         3.1          1.5         0.2 setosa 
#>  4          4.6         3.4          1.4         0.3 setosa 
#>  5          4.4         2.9          1.4         0.2 setosa 
#>  6          4.9         3.1          1.5         0.1 setosa 
#>  7          4.8         3.4          1.6         0.2 setosa 
#>  8          4.8         3            1.4         0.1 setosa 
#>  9          4.3         3            1.1         0.1 setosa 
#> 10          4.6         3.6          1           0.2 setosa 
#> # ... with 12 more rows

^{Created on 2022-01-28 by the reprex package (v2.0.1)}

To get a list of data.frames we use group_map:

library(dplyr)

iris %>% 
  group_by(Species) %>% 
  group_map(~ filter(.x, Sepal.Length < 5))

#> [[1]]
#> # A tibble: 20 x 4
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>           <dbl>       <dbl>        <dbl>       <dbl>
#>  1          4.9         3            1.4         0.2
#>  2          4.7         3.2          1.3         0.2
#>  3          4.6         3.1          1.5         0.2
#>  4          4.6         3.4          1.4         0.3
#>  5          4.4         2.9          1.4         0.2
#>  6          4.9         3.1          1.5         0.1
#>  7          4.8         3.4          1.6         0.2
#>  8          4.8         3            1.4         0.1
#>  9          4.3         3            1.1         0.1
#> 10          4.6         3.6          1           0.2
#> 11          4.8         3.4          1.9         0.2
#> 12          4.7         3.2          1.6         0.2
#> 13          4.8         3.1          1.6         0.2
#> 14          4.9         3.1          1.5         0.2
#> 15          4.9         3.6          1.4         0.1
#> 16          4.4         3            1.3         0.2
#> 17          4.5         2.3          1.3         0.3
#> 18          4.4         3.2          1.3         0.2
#> 19          4.8         3            1.4         0.3
#> 20          4.6         3.2          1.4         0.2
#> 
#> [[2]]
#> # A tibble: 1 x 4
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#>          <dbl>       <dbl>        <dbl>       <dbl>
#> 1          4.9         2.4          3.3           1
#> 
#> [[3]]
#> # A tibble: 1 x 4
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#>          <dbl>       <dbl>        <dbl>       <dbl>
#> 1          4.9         2.5          4.5         1.7

^{Created on 2022-01-28 by the reprex package (v2.0.1)}

CodePudding user response：

As the question is tagged with automation, my suggestion is to keep the data in a common data structure instead of creating individual (and individually named) data objects. A common data structure will allow for automation of subsequent processing steps.

The split-apply-combine paradigm is often used in data analysis, see here or here, e.g.

It can be implemented in several ways. My preferred option is to keep all data in one data.frame as long as possible and to use grouping to process individual subsets rather than splitting and to process list elements.

So, OP's example to filter subsets by an individually computed threshold (95% quantile) can be implemented, for instance, using dplyr or data.table:

library(dplyr)
iris %>% 
  group_by(Species) %>% 
  filter(Sepal.Length < quantile(Sepal.Length, 0.95))

# A tibble: 139 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ... with 129 more rows

library(data.table)  
as.data.table(iris)[, .SD[Sepal.Length < quantile(Sepal.Length, 0.95)], by = Species]

       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
        <fctr>        <num>       <num>        <num>       <num>
  1:    setosa          5.1         3.5          1.4         0.2
  2:    setosa          4.9         3.0          1.4         0.2
  3:    setosa          4.7         3.2          1.3         0.2
  4:    setosa          4.6         3.1          1.5         0.2
  5:    setosa          5.0         3.6          1.4         0.2
 ---                                                            
135: virginica          6.7         3.0          5.2         2.3
136: virginica          6.3         2.5          5.0         1.9
137: virginica          6.5         3.0          5.2         2.0
138: virginica          6.2         3.4          5.4         2.3
139: virginica          5.9         3.0          5.1         1.8

Note that the result still is a data.frame (or tibble or data.table, resp.) but now with 139 rows instead of 150. There is no call to bind_rows() required.

In order to verify which rows have been filtered out, we can show the 95% quantile for each group

as.data.table(iris)[, quantile(Sepal.Length, 0.95), by = Species]

      Species    V1
       <fctr> <num>
1:     setosa 5.610
2: versicolor 6.755
3:  virginica 7.700

as well as the outliers

as.data.table(iris)[, .SD[Sepal.Length >= quantile(Sepal.Length, 0.95)], by = Species]

       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
        <fctr>        <num>       <num>        <num>       <num>
 1:     setosa          5.8         4.0          1.2         0.2
 2:     setosa          5.7         4.4          1.5         0.4
 3:     setosa          5.7         3.8          1.7         0.3
 4: versicolor          7.0         3.2          4.7         1.4
 5: versicolor          6.9         3.1          4.9         1.5
 6: versicolor          6.8         2.8          4.8         1.4
 7:  virginica          7.7         3.8          6.7         2.2
 8:  virginica          7.7         2.6          6.9         2.3
 9:  virginica          7.7         2.8          6.7         2.0
10:  virginica          7.9         3.8          6.4         2.0
11:  virginica          7.7         3.0          6.1         2.3