Given the following data frame:
df <-
data.frame(one_letter = rep("a", 5),
other_letters = letters[2:6])
df
#> one_letter other_letters
#> 1 a b
#> 2 a c
#> 3 a d
#> 4 a e
#> 5 a f
I want to combine both columns to one, to get:
#> all_letters_combined
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
#> 6 f
Although I could utilize dplyr&tidyr and do the following:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
# yes, it gets the job done
df %>%
pivot_longer(everything()) %>%
select(value) %>%
unique()
#> # A tibble: 6 x 1
#> value
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
#> 6 f
I'm nevertheless looking for a faster/more direct way to do it. This is because speed becomes an issue when our df is a tibble with list-columns that contain dataframes. Here's an example, although still pretty minimal:
library(nycflights13)
library(babynames)
library(tictoc)
bigger_tib <-
tibble(one_df = rep(list(babynames), 10),
other_dfs = list(starwars, flights, mtcars, trees, women, PlantGrowth, ToothGrowth, co2, Titanic, USArrests))
tic()
bigger_tib %>%
pivot_longer(everything()) %>%
select(value) %>%
unique()
#> # A tibble: 11 x 1
#> value
#> <list>
#> 1 <tibble [1,924,665 x 5]>
#> 2 <tibble [87 x 14]>
#> 3 <tibble [336,776 x 19]>
#> 4 <df [32 x 11]>
#> 5 <df [31 x 3]>
#> 6 <df [15 x 2]>
#> 7 <df [30 x 2]>
#> 8 <df [60 x 3]>
#> 9 <ts [468]>
#> 10 <table [4 x 2 x 2 x 2]>
#> 11 <df [50 x 4]>
toc()
#> 0.97 sec elapsed
I know the example isn't great because it doesn't demonstrate problematic run time, but in my real data this procedure gets pretty slow and I want to speed it up.
CodePudding user response:
The bottleneck is unique, which becomes extremely costly when applied to
a list of dataframes. distinct would be faster. On the other hand, if you already know that the dataframes are unique before pivoting them, giving each of them a unique id to preserve this relationship would be an even more ideal approach. That said, consider the following benchmark.
library(dplyr)
library(tidyr)
f1 <- . %>% pivot_longer(everything()) %>% select(value) %>% unique()
f2 <- . %>% pivot_longer(everything()) %>% select(value) %>% distinct()
f3 <- . %>%
rename(one_df = one_df, other_df = other_dfs) %>%
mutate(one_id = 0L, other_id = row_number()) %>%
pivot_longer(starts_with(c("one", "other")), c(NA, ".value"), names_sep = "_") %>%
distinct(id, .keep_all = TRUE)
microbenchmark::microbenchmark(f1(bigger_tib), f2(bigger_tib), f3(bigger_tib), times = 10L)
Output
> f3(bigger_tib)
# A tibble: 11 x 2
df id
<list> <int>
1 <tibble [1,924,665 x 5]> 0
2 <tibble [87 x 14]> 1
3 <df [50 x 2]> 2
4 <df [32 x 11]> 3
5 <df [31 x 3]> 4
6 <df [15 x 2]> 5
7 <df [30 x 2]> 6
8 <df [60 x 3]> 7
9 <ts [468]> 8
10 <table [4 x 2 x 2 x 2]> 9
11 <df [50 x 4]> 10
Benchmark
Unit: milliseconds
expr min lq mean median uq max neval
f1(bigger_tib) 619.5852 623.8327 638.0796 634.4866 644.9060 687.6760 10
f2(bigger_tib) 230.6140 231.6163 234.4957 234.1330 237.1576 238.6012 10
f3(bigger_tib) 4.0693 5.2220 5.5078 5.2996 5.4089 8.6592 10
One special note on that pivot_longer line: it means that we use the characters after "_" as names_to, discard the characters before "_". All values stack in the same column if having the same characters after "_".
CodePudding user response:
If the problem as stated is merging the unique value in first column with the second column. If the first column is just a repeated value and the second column contains all unique values then a simple solution is:
data.frame(all_letters_combined=c(df[1,1], df[,2]))
If you need to remove duplicates (duplicates in column 2 or column 1 is duplicated in column 2) from the resulting column. Based on ekoam's observation that dplyr::distinct() is faster than unique()
Then here an option:
distinct(data.frame(all_letters_combined=c(df[1,1], df[,2])))
Of course if there are more columns and the different possibilities of values than a more complex solution would be required.
