Home > Blockchain >  How can I best use dplyr to subset data and create relative frequency tables?
How can I best use dplyr to subset data and create relative frequency tables?

Time:02-03

I'm using the iris data set to learn how to use dplyr, and am trying to create a relative frequency table that looks like this:

Petal.Width .1 .2 .3 .4 .5 .6 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Species
setosa 0.10 0.58 0.14 0.14 0.02 0.02 0 0 0 0 0 0 0 0 0
versicolor 0 0 0 0 0 0 0.14 0.06 0.10 0.26 0.14 0.02 0.20 0.04 0.06

I'm struggling to group the observations by species, and then produce relative frequencies on a species by species basis.

I'm guessing it'll have to be something using group_by, mutate, and count, but the closest thing I could find online was this:

my_data %>% 
    group_by(Petal.Width,Species) %>% 
    summarise(n = n()) %>%
    ungroup %>% 
    mutate(total = sum(n), rel.freq = n / total)

This was still not quite what I was looking for as it is the total number of observations, not the number per species.

Any help is appreciated greatly!

CodePudding user response:

You could do this in dplyr, but it's a one liner in base R:

t(apply(table(iris$Species, iris$Petal.Width), 1, function(x) x/sum(x)))
#>             
#>              0.1  0.2  0.3  0.4  0.5  0.6    1  1.1 1.2  1.3  1.4  1.5  1.6
#>   setosa     0.1 0.58 0.14 0.14 0.02 0.02 0.00 0.00 0.0 0.00 0.00 0.00 0.00
#>   versicolor 0.0 0.00 0.00 0.00 0.00 0.00 0.14 0.06 0.1 0.26 0.14 0.20 0.06
#>   virginica  0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.02 0.04 0.02
#>             
#>               1.7  1.8 1.9    2  2.1  2.2  2.3  2.4  2.5
#>   setosa     0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00
#>   versicolor 0.02 0.02 0.0 0.00 0.00 0.00 0.00 0.00 0.00
#>   virginica  0.02 0.22 0.1 0.12 0.12 0.06 0.16 0.06 0.06

Created on 2022-02-02 by the reprex package (v2.0.1)

CodePudding user response:

Something like this?

Not sure about the "wide" format though; I'd be inclined to keep it as long (omit the pivot_wider step).

library(dplyr)
library(tidyr)

iris %>% 
  count(Species, Petal.Width) %>% 
  group_by(Species) %>% 
  mutate(p = n/sum(n)) %>% 
  ungroup() %>% 
  select(-n) %>% 
  pivot_wider(names_from = "Petal.Width", values_from = "p")

Result:

Species    `0.1` `0.2` `0.3` `0.4` `0.5` `0.6`   `1` `1.1` `1.2` `1.3` `1.4` `1.5` `1.6` `1.7` `1.8` `1.9`   `2` `2.1` `2.2` `2.3` `2.4` `2.5`
  <fct>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa       0.1  0.58  0.14  0.14  0.02  0.02 NA    NA     NA   NA    NA    NA    NA    NA    NA     NA   NA    NA    NA    NA    NA    NA   
2 versicolor  NA   NA    NA    NA    NA    NA     0.14  0.06   0.1  0.26  0.14  0.2   0.06  0.02  0.02  NA   NA    NA    NA    NA    NA    NA   
3 virginica   NA   NA    NA    NA    NA    NA    NA    NA     NA   NA     0.02  0.04  0.02  0.02  0.22   0.1  0.12  0.12  0.06  0.16  0.06  0.06
  •  Tags:  
  • Related