How to speed up combining columns when one column is just a repetition of the same value?-CodePudding

Given the following data frame:

df <-
  data.frame(one_letter = rep("a", 5),
             other_letters = letters[2:6])

df
#>   one_letter other_letters
#> 1          a             b
#> 2          a             c
#> 3          a             d
#> 4          a             e
#> 5          a             f

I want to combine both columns to one, to get:

#>   all_letters_combined
#> 1                    a
#> 2                    b
#> 3                    c
#> 4                    d
#> 5                    e
#> 6                    f

Although I could utilize dplyr&tidyr and do the following:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

# yes, it gets the job done
df %>%
  pivot_longer(everything()) %>%
  select(value) %>%
  unique()
#> # A tibble: 6 x 1
#>   value
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 c    
#> 4 d    
#> 5 e    
#> 6 f

I'm nevertheless looking for a faster/more direct way to do it. This is because speed becomes an issue when our df is a tibble with list-columns that contain dataframes. Here's an example, although still pretty minimal:

library(nycflights13)
library(babynames)
library(tictoc)


bigger_tib <- 
  tibble(one_df = rep(list(babynames), 10),
         other_dfs = list(starwars, flights, mtcars, trees, women, PlantGrowth, ToothGrowth, co2, Titanic, USArrests))

tic()
bigger_tib %>%
  pivot_longer(everything()) %>%
  select(value) %>%
  unique()
#> # A tibble: 11 x 1
#>    value                   
#>    <list>                  
#>  1 <tibble [1,924,665 x 5]>
#>  2 <tibble [87 x 14]>      
#>  3 <tibble [336,776 x 19]> 
#>  4 <df [32 x 11]>          
#>  5 <df [31 x 3]>           
#>  6 <df [15 x 2]>           
#>  7 <df [30 x 2]>           
#>  8 <df [60 x 3]>           
#>  9 <ts [468]>              
#> 10 <table [4 x 2 x 2 x 2]> 
#> 11 <df [50 x 4]>
toc()
#> 0.97 sec elapsed

I know the example isn't great because it doesn't demonstrate problematic run time, but in my real data this procedure gets pretty slow and I want to speed it up.

CodePudding user response：

The bottleneck is unique, which becomes extremely costly when applied to a list of dataframes. distinct would be faster. On the other hand, if you already know that the dataframes are unique before pivoting them, giving each of them a unique id to preserve this relationship would be an even more ideal approach. That said, consider the following benchmark.

library(dplyr)
library(tidyr)

f1 <- . %>% pivot_longer(everything()) %>% select(value) %>% unique()
f2 <- . %>% pivot_longer(everything()) %>% select(value) %>% distinct()
f3 <- . %>% 
  rename(one_df = one_df, other_df = other_dfs) %>% 
  mutate(one_id = 0L, other_id = row_number()) %>% 
  pivot_longer(starts_with(c("one", "other")), c(NA, ".value"), names_sep = "_") %>% 
  distinct(id, .keep_all = TRUE)

microbenchmark::microbenchmark(f1(bigger_tib), f2(bigger_tib), f3(bigger_tib), times = 10L)

Output

> f3(bigger_tib)
# A tibble: 11 x 2
   df                          id
   <list>                   <int>
 1 <tibble [1,924,665 x 5]>     0
 2 <tibble [87 x 14]>           1
 3 <df [50 x 2]>                2
 4 <df [32 x 11]>               3
 5 <df [31 x 3]>                4
 6 <df [15 x 2]>                5
 7 <df [30 x 2]>                6
 8 <df [60 x 3]>                7
 9 <ts [468]>                   8
10 <table [4 x 2 x 2 x 2]>      9
11 <df [50 x 4]>               10

Benchmark

Unit: milliseconds
           expr      min       lq     mean   median       uq      max neval
 f1(bigger_tib) 619.5852 623.8327 638.0796 634.4866 644.9060 687.6760    10
 f2(bigger_tib) 230.6140 231.6163 234.4957 234.1330 237.1576 238.6012    10
 f3(bigger_tib)   4.0693   5.2220   5.5078   5.2996   5.4089   8.6592    10

One special note on that pivot_longer line: it means that we use the characters after "_" as names_to, discard the characters before "_". All values stack in the same column if having the same characters after "_".

CodePudding user response：

If the problem as stated is merging the unique value in first column with the second column. If the first column is just a repeated value and the second column contains all unique values then a simple solution is:

data.frame(all_letters_combined=c(df[1,1], df[,2]))

If you need to remove duplicates (duplicates in column 2 or column 1 is duplicated in column 2) from the resulting column. Based on ekoam's observation that dplyr::distinct() is faster than unique() Then here an option:

distinct(data.frame(all_letters_combined=c(df[1,1], df[,2])))

Of course if there are more columns and the different possibilities of values than a more complex solution would be required.