I have a dataset that is not normally distributed and contains a lot of 0 values. I now want to calculate the tertiles for each column.
df <- tribble(
~shop, ~products, ~sales,
'A', 300, 100,
'B', 0, 0,
'C', 10, 2000,
'D', 0, 0,
'E', 0, 0,
'F', 0, 0,
'G', 20, 10,
'H', 0, 0,
'J', 700, 50,
'K', 0, 0,
)
Thanks to @AlexB 's answer to that question, I've tried calculating the tertiles with the following code:
df %>%
arrange(products) %>%
mutate(tertiles = ntile(products, 3)) %>%
mutate(tertiles = if_else(tertiles == 1, 'Low', if_else(tertiles == 2, 'Medium', 'High')))
However, even though the values are 0, the output is turning to 'high'. How can I calculate it more accurately?
CodePudding user response:
I think what you're looking for could be achieved by using cut instead of ntile. Use the breaks argument in cut to define the limits for the three labels, and the labels argument to specify the labels themselves.
df %>%
arrange(products) %>%
mutate(tertile = cut(products,
breaks = c(-1, 1, 100, Inf),
labels = c("low", "medium", "high")))
#> # A tibble: 10 x 4
#> shop products sales tertile
#> <chr> <dbl> <dbl> <fct>
#> 1 B 0 0 low
#> 2 D 0 0 low
#> 3 E 0 0 low
#> 4 F 0 0 low
#> 5 H 0 0 low
#> 6 K 0 0 low
#> 7 C 10 2000 medium
#> 8 G 20 10 medium
#> 9 A 300 100 high
#> 10 J 700 50 high
Addendum
To apply the same method to each column, we can do:
f <- function(x) cut(x, c(-1, 1, 100, Inf), c("low", "medium", "high"))
df %>%
arrange(products) %>%
mutate(across(c("products", "sales"), .fns = f, .names = "{.col}_tertile"))
#> # A tibble: 10 x 5
#> shop products sales products_tertile sales_tertile
#> <chr> <dbl> <dbl> <fct> <fct>
#> 1 B 0 0 low low
#> 2 D 0 0 low low
#> 3 E 0 0 low low
#> 4 F 0 0 low low
#> 5 H 0 0 low low
#> 6 K 0 0 low low
#> 7 C 10 2000 medium high
#> 8 G 20 10 medium medium
#> 9 A 300 100 high medium
#> 10 J 700 50 high medium
Created on 2022-01-23 by the reprex package (v2.0.1)
CodePudding user response:
Update:
Somehow it is obvious that I was the first with the correct solution. But Allan Cameron finished it perfectly. So this is okay as I learned many things from Allan Cameron:
To give my final solution:
df %>%
mutate(across(c(products, sales), ~cut(., breaks = 3, labels = c("low", "medium", "high")), .names = "tertile_{.col}"))
shop products sales tertile_products tertile_sales
<chr> <dbl> <dbl> <fct> <fct>
1 A 300 100 medium low
2 B 0 0 low low
3 C 10 2000 low high
4 D 0 0 low low
5 E 0 0 low low
6 F 0 0 low low
7 G 20 10 low low
8 H 0 0 low low
9 J 700 50 high low
10 K 0 0 low low
First answer:
For column products:
df %>%
arrange(products) %>%
mutate(tertiles = cut(products, breaks = 3, labels = c(1:3))) %>%
mutate(tertiles = case_when(tertiles==1 ~ "Low",
tertiles==2 ~ "Medium",
tertiles==3 ~ "High",
TRUE ~NA_character_))
shop products sales tertiles
<chr> <dbl> <dbl> <chr>
1 B 0 0 Low
2 D 0 0 Low
3 E 0 0 Low
4 F 0 0 Low
5 H 0 0 Low
6 K 0 0 Low
7 C 10 2000 Low
8 G 20 10 Low
9 A 300 100 Medium
10 J 700 50 High
