I have a large dataframe in which some columns contain long strings of comma-separated numerical data of unequal lengths, here the columns A, B, and C:
df <- data.frame(
id = 1:3,
A = c("200, 100, 80, 100","120, 210, 220", "170, 200"),
B = c("0.1, 0.2, 0.3","0.2, 0.3, 1.0, 0.4, 0.9", "0.55, 0.77, 0.99, 0.35"),
C = c("700.1, 701.0, 699.2", "702.5, 702.9", "705.4, 705.4, 706.0")
)
I need to calculate for this numerical data in A, B, and C the percentage change. I take it that to facilitate this, I need to separate each number into its own row using separate_rows. But how can I do this step for all three columns A, B, and C in one go?
All I can do is do it column by column - first for A, then for B, and finally for C:
library(tidyverse)
df %>%
# Step 1 - column `A`:
separate_rows(A, sep = ",", convert = TRUE) %>%
mutate(A_0 = lag((lead(A)-A)/A*100)) %>%
group_by(id) %>%
summarise(across(c(B,C), first),
A = paste0(A, collapse = ", "),
A_0 = paste0(A_0, collapse = ", ")
) %>%
ungroup() %>%
# Step 2 - column `B`:
separate_rows(B, sep = ",", convert = TRUE) %>%
mutate(B_0 = lag((lead(B)-B)/B*100)) %>%
group_by(id) %>%
summarise(across(c(A,A_0,C), first),
B = paste0(B, collapse = ", "),
B_0 = paste0(B_0, collapse = ", ")
) %>%
ungroup() %>%
# Step 3 - column `C`:
separate_rows(C, sep = ",", convert = TRUE) %>%
mutate(C_0 = lag((lead(C)-C)/C*100)) %>%
group_by(id) %>%
summarise(across(c(A,A_0,B,B_0), first),
C = paste0(C, collapse = ", "),
C_0 = paste0(C_0, collapse = ", ")
)
# A tibble: 3 × 7
id A A_0 B B_0 C C_0
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 200, 100, 80, 100 NA, -50, -20, 25 0.1, 0.… NA, 100, 50 700.1… NA, 0.12…
2 2 120, 210, 220 20, 75, 4.76190476190476 0.2, 0.… -33.33333333… 702.5… 0.471967…
3 3 170, 200 -22.7272727272727, 17.6470588235294 0.55, 0… -38.88888888… 705.4… 0.355669…
Is there a better way?
CodePudding user response:
We may loop across the columns, split at the , followed by one or more spaces (\\s ), loop over the list with map, convert to numeric, get the lag of the difference between lead and the current value proportion, pasted (toString) return as a character vector (_chr), and if needed order the columns within select
library(dplyr)
library(purrr)
df %>%
mutate(across(A:C, ~ {
map_chr(strsplit(.x, ",\\s "), ~ {
tmp <- as.numeric(.x)
toString(lag((lead(tmp)- tmp)/tmp *100))})
}, .names = "{.col}_0")) %>%
select(id, gtools::mixedsort(names(.)[-1]))
-output
id A A_0 B B_0 C
1 1 200, 100, 80, 100 NA, -50, -20, 25 0.1, 0.2, 0.3 NA, 100, 50 700.1, 701.0, 699.2
2 2 120, 210, 220 NA, 75, 4.76190476190476 0.2, 0.3, 1.0, 0.4, 0.9 NA, 50, 233.333333333333, -60, 125 702.5, 702.9
3 3 170, 200 NA, 17.6470588235294 0.55, 0.77, 0.99, 0.35 NA, 40, 28.5714285714286, -64.6464646464647 705.4, 705.4, 706.0
C_0
1 NA, 0.128553063848018, -0.256776034236798
2 NA, 0.0569395017793562
3 NA, 0, 0.0850581230507546
CodePudding user response:
Something like this can get you started
library(tidyverse)
#> Warning: package 'tidyr' was built under R version 4.1.3
#> Warning: package 'readr' was built under R version 4.1.3
#> Warning: package 'dplyr' was built under R version 4.1.3
df_example <- data.frame(
id = 1:3,
A = c("200, 100, 80, 100","120, 210, 220", "170, 200"),
B = c("0.1, 0.2, 0.3","0.2, 0.3, 1.0, 0.4, 0.9", "0.55, 0.77, 0.99, 0.35"),
C = c("700.1, 701.0, 699.2", "702.5, 702.9", "705.4, 705.4, 706.0")
)
df_example |>
as_tibble() |>
mutate(across(-id,
.names = "{.col}_0",
\(x) x |>
str_split(',') |>
map(as.numeric) |>
map(\(x) lag((lead(x)-x)/x*100)) |>
map_chr(toString))
)
#> # A tibble: 3 x 7
#> id A B C A_0 B_0 C_0
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 200, 100, 80, 100 0.1, 0.2, 0.3 700.1, 701.~ NA, ~ NA, ~ NA, ~
#> 2 2 120, 210, 220 0.2, 0.3, 1.0, 0.4, 0.9 702.5, 702.9 NA, ~ NA, ~ NA, ~
#> 3 3 170, 200 0.55, 0.77, 0.99, 0.35 705.4, 705.~ NA, ~ NA, ~ NA, ~
Created on 2022-03-28 by the reprex package (v2.0.1)
