Home > Mobile >  Sort While Preserving Locations of Groups
Sort While Preserving Locations of Groups

Time:01-19

Suppose I have a tibble like tb_1 here

# A tibble: 7 x 2
  Grp     Srt
  <chr> <int>
1 A        10
2 B         4
3 B         7
4 A         5
5 A         1
6 A         3
7 B         2

which I have reproduced below:

tb_1 <- structure(
  list(
    Grp = c("A", "B", "B", "A", "A", "A", "B"),
    Srt = c(10L, 4L, 7L, 5L, 1L, 3L, 2L)
  ),
  class = c("tbl_df", "tbl", "data.frame"),
  row.names = c(NA, -7L)
)

I would like a performant function arrange_groups() in the style of dplyr; which will sort (by given variables) the observations within each existing group, while preserving the locations where that group is distributed.

library(dplyr)


tb_2 <- tb_1 %>%

  # Group by 'Grp'.
  group_by(Grp) %>%

  # Sort by 'Srt' WITHIN each group.
  arrange_groups(Srt)

In the resulting tb_2, the 4 observations from the "A" group should remain distributed among the 1st, 4th, 5th, and 6th rows; after they have been sorted by Srt among themselves. Likewise, the 3 observations from the "B" group should remain distributed among the 2nd, 3rd, and 7th rows.

# A tibble: 7 x 2
# Groups:   Grp [2]
  Grp     Srt
  <chr> <int>
1 A         1
2 B         2
3 B         4
4 A         3
5 A         5
6 A        10
7 B         7

I have reproduced tb_2 below:

tb_2 <- structure(
  list(
    Grp = c("A", "B", "B", "A", "A", "A", "B"),
    Srt = c(1L, 2L, 4L, 3L, 5L, 10L, 7L)
  ),
  class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
  row.names = c(NA, -7L),
  groups = structure(
    list(
      Grp = c("A", "B"),
      .rows = structure(
        list(
          c(1L, 4L, 5L, 6L),
          c(2L, 3L, 7L)
        ),
        ptype = integer(0),
        class = c("vctrs_list_of", "vctrs_vctr", "list")
      )
    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA, -2L),
    .drop = TRUE
  )
)

Update

While I was able to answer my own question, I am leaving the floor open for other solutions. I am curious to see what alternatives exist, especially those that are more performant, more creative, or work with different ecosystems like data.table.

Toward Optimization

Further solutions should ideally

  1. avoid recalculating order(Srt_1, Srt_2, ...) for every column in df;
  2. be no slower than existing suggestions in data.table.

CodePudding user response:

Solutions

Within the tidyverse, that goal can be accomplished by either a simple workflow or (among others) the following two functions.

Workflow

You could simply ignore arrange_groups() and instead implement a dplyr workflow with mutate(), since operations (like order()) will be applied within groups anyway.

library(dplyr)

tb_1 %>%
    group_by(Grp) %>%

    # Arbitrary sorting variables go HERE:
    mutate(across(everything(), ~.[order(Srt)]))
    #                                    ^^^

Reordering Function

This arrange_groups_1() function sorts first by existing groups, and then by the variables given in .... With the data thus sorted within its groups, arrange_groups_1() then maps those groups back to their original locations.

arrange_groups_1 <- function(.data, ...) {
  # Arrange into group "regions", and sort within each region; then...
  dplyr::arrange(.data = .data, ... = ..., .by_group = TRUE)[
    # ...map the results back to the original structure.
    order(order(dplyr::group_indices(.data = .data))),
  ]
}

It is compatible with dplyr:

library(dplyr)

tb_1 %>%
    group_by(Grp) %>%
    arrange_groups_1(Srt)

Mutating Function

Less clever but more straightforward than arrange_groups_1(), the arrange_groups_2() solution simply implements the workflow in functional form.

arrange_groups_2 <- function(.data, ...) {
  # Capture the symbols for the sorting variables.
  dots <- dplyr::enquos(...)
  
  dplyr::mutate(
    .data = .data,
    dplyr::across(
      # Sort across the entire dataset.
      .cols = dplyr::everything(),

      # Sort each group "in place"; by variables captured from the proper scope.
      .fns = ~.[order(!!!dots)]
    )
  )
}

It too is compatible with dplyr:

library(dplyr)

tb_1 %>%
    group_by(Grp) %>%
    arrange_groups_2(Srt)

Result

Given a tb_1 like yours, all of these solutions will yield the desired result:

# A tibble: 7 x 2
# Groups:   Grp [2]
  Grp     Srt
  <chr> <int>
1 A         1
2 B         2
3 B         4
4 A         3
5 A         5
6 A        10
7 B         7

Performance

On large datasets, the disparity in performance might become significant. Given a df with 1 million observations and several variables for grouping (Grp_*) and sorting (Srt_*)

set.seed(0)

df <- data.frame(
    Record_ID = 1:1000000,

    Grp_1 = sample(x = letters[ 1:6 ] , size = 1000000, replace = TRUE ),
    Grp_2 = sample(x = letters[ 7:12] , size = 1000000, replace = TRUE ),
    Grp_3 = sample(x = letters[13:18] , size = 1000000, replace = TRUE ),
    Grp_4 = sample(x = letters[19:26] , size = 1000000, replace = TRUE ),
    
    Srt_1 = sample(x =       1:1000000, size = 1000000, replace = FALSE),
    Srt_2 = sample(x = 1000001:2000000, size = 1000000, replace = FALSE),
    Srt_3 = sample(x = 2000001:3000000, size = 1000000, replace = FALSE),
    Srt_4 = sample(x = 3000001:4000000, size = 1000000, replace = FALSE)
)

here are calculated the relative performances of each solution:

library(dplyr)
library(microbenchmark)

performances <- list(
  one_var = microbenchmark(
    arrange_groups_1 = df %>%
      group_by(Grp_1) %>%
      arrange_groups_1(Srt_1), 
    arrange_groups_2 = df %>%
      group_by(Grp_1) %>%
      arrange_groups_2(Srt_1), 
    workflow = df %>%
      group_by(Grp_1) %>%
      mutate(across(everything(), ~.[order(Srt_1)])),
    times = 50
  ),
  two_vars = microbenchmark(
    arrange_groups_1 = df %>%
      group_by(Grp_1, Grp_2) %>%
      arrange_groups_1(Srt_1, Srt_2),
    arrange_groups_2 = df %>%
      group_by(Grp_1, Grp_2) %>%
      arrange_groups_2(Srt_1, Srt_2),
    workflow = df %>%
      group_by(Grp_1, Grp_2) %>%
      mutate(across(everything(), ~.[order(Srt_1, Srt_2)])),
    times = 50
  ),
  three_vars = microbenchmark(
    arrange_groups_1 = df %>%
      group_by(Grp_1, Grp_2, Grp_3) %>%
      arrange_groups_1(Srt_1, Srt_2, Srt_3),
    arrange_groups_2 = df %>%
      group_by(Grp_1, Grp_2, Grp_3) %>%
      arrange_groups_2(Srt_1, Srt_2, Srt_3),
    workflow = df %>%
      group_by(Grp_1, Grp_2, Grp_3) %>%
      mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3)])),
    times = 50
  ),
  four_vars = microbenchmark(
    arrange_groups_1 = df %>%
      group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
      arrange_groups_1(Srt_1, Srt_2, Srt_3, Srt_4),
    arrange_groups_2 = df %>%
      group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
      arrange_groups_2(Srt_1, Srt_2, Srt_3, Srt_4),
    workflow = df %>%
      group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
      mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3, Srt_4)])),
    times = 50
  )
)

Clearly arrange_groups_1() is outclassed. I suspect arrange_groups_2() can hold its own against the workflow, and remain within sight of the latter while offering more ergonomic usage. However, this suspicion should be verified on other (and better) machines for larger sets of grouping and sorting variables.

#> performances

$one_var
Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 arrange_groups_1 2066.4674 2155.8859 2231.3547 2199.7442 2283.5782 2565.0542    50
 arrange_groups_2  352.3775  385.1829  435.2595  444.8746  464.1493  607.0927    50
         workflow  337.2756  391.0174  428.9049  435.8385  454.7347  546.4498    50

$two_vars
Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 arrange_groups_1 3580.5395 3688.1506 3842.2048 3799.5430 3979.9716 4317.7100    50
 arrange_groups_2  230.1166  239.9141  265.0786  249.3640  287.1006  359.1822    50
         workflow  221.6627  234.2732  256.6200  243.3707  281.2269  365.9102    50

$three_vars
Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 arrange_groups_1 5113.6341 5340.5483 5441.3399 5443.5068 5535.0578 5946.6958    50
 arrange_groups_2  261.9329  274.1785  295.6854  282.4638  323.5710  412.0139    50
         workflow  224.8709  236.9958  263.2440  252.6042  292.7043  339.6351    50

$four_vars
Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 arrange_groups_1 6810.3864 7035.7077 7237.6941 7156.7051 7314.4667 8051.8558    50
 arrange_groups_2  581.9000  603.7822  640.8977  626.4116  672.6488  859.8239    50
         workflow  349.7786  361.6454  391.7517  375.1532  429.3643  485.9227    50

Update

Hybrid Function

Inspired by @akrun's answer, here is a function that integrates the power of data.table...

arrange_groups_3 <- function(.data, ...) {
  # Name the variables for grouping, and their complement in '.data'.
  group_vars <- dplyr::group_vars(.data)
  other_vars <- setdiff(names(.data), group_vars)

  # For proper scoping, generate here the expression for sorting.
  sort_expr <- substitute(order(...))
  
  dplyr::as_tibble(data.table::as.data.table(.data)[,
    (other_vars) := lapply(
      # Sort each column, using an index...
      .SD, \(x, i) x[i],

      # ...which we need calculate only once.
      i = eval(sort_expr)
    ),
    group_vars
  ])
}

...with the ergonomics of dplyr.

library(dplyr)

tb_1 %>%
    group_by(Grp) %>%
    arrange_groups_3(Srt)

However, my implementation drops the original grouping in .data, so it's still a work in progress.

Fast Mutate

This rather fast implementation was inspired by @Henrik's suggestion to use dtplyr, a data.table backend for dplyr.

arrange_groups_4 <- function(.data, ...) {
  # Capture the symbols for the sorting and grouping variables.
  sort_syms <- dplyr::enquos(...)
  group_syms <- dplyr::groups(.data)
  
  .data |>

    # Use a "data.table" backend.
    dtplyr::lazy_dt() |>

    # Preserve the grouping.
    dplyr::group_by(!!!group_syms) |>

    # Perform the sorting.
    dplyr::mutate(
      dplyr::across(
        # Sort across the entire dataset.
        .cols = dplyr::everything(),
        
        # Sort each group "in place": subscript using the index...
        .fns = `[`,
        
        # ...generated when ordering by the sorting variables.
        i = order(!!!sort_syms)
      )
    )
}

Although I have yet to test it for more than 4 grouping and sorting variables, it seems to complete in linear time:

$one_var
Unit: milliseconds
             expr    min      lq     mean  median      uq     max neval
 arrange_groups_4 30.738 31.8028 46.81692 37.6586 59.8274 95.4703    50

$two_vars
Unit: milliseconds
             expr     min      lq     mean  median      uq    max neval
 arrange_groups_4 41.4364 41.9118 52.91332 46.4306 66.1674 80.171    50

$three_vars
Unit: milliseconds
             expr     min      lq     mean  median     uq      max neval
 arrange_groups_4 47.8605 48.6225 62.06675 51.9562 71.487 237.0102    50

$four_vars
Unit: milliseconds
             expr    min      lq     mean   median      uq      max neval
 arrange_groups_4 67.306 69.1426 78.68869 73.81695 88.7874 108.2624    50

CodePudding user response:

The question asked about dplyr. Here, is an attempt with data.table as this also involves efficiency. Benchmarks with OP's big dataset 'df' are below

library(data.table)
system.time({

df %>%
      group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
      mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3, Srt_4)]))
})
# user  system elapsed 
#  0.552   0.013   0.564 


system.time({

grpnms <- grep("Grp", names(df), value = TRUE)
othernms <- setdiff(names(df), grpnms)
setDT(df)[, (othernms) := lapply(.SD, \(x) 
         x[order(Srt_1, Srt_2, Srt_3, Srt_4)]), grpnms]


})
#  user  system elapsed 
#  0.348   0.012   0.360 

CodePudding user response:

Here's another dplyr solution relying on a join preserving the row order. (The id column can of course be dropped as a last step, and the temporary objects aren't necessary to create separately, but the method is nice and clear with this presentation.)

group_order = tb_1 %>%
  select(Grp) %>%
  group_by(Grp) %>%
  mutate(id = row_number())

row_order = tb_1 %>%
  arrange(Srt) %>%
  group_by(Grp) %>%
  mutate(id = row_number())

result = group_order %>% left_join(row_order, by = c("Grp", "id"))
result
# # A tibble: 7 × 3
# # Groups:   Grp [2]
#   Grp      id   Srt
#   <chr> <int> <int>
# 1 A         1     1
# 2 B         1     2
# 3 B         2     4
# 4 A         2     3
# 5 A         3     5
# 6 A         4    10
# 7 B         3     7

Benchmarking, this is better than arrange_groups_1 but otherwise not great:

four_vars = microbenchmark(
     arrange_groups_2 = df %>%
      group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
      arrange_groups_2(Srt_1, Srt_2, Srt_3, Srt_4),
    workflow = df %>%
      group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
      mutate(across(everything(), ~.[order(Srt_1, Srt_2, Srt_3, Srt_4)])),
    join = {
      df %>%
        group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
        mutate(id = row_number()) %>%
        left_join(
          df %>%
            arrange(Srt_1, Srt_2, Srt_3, Srt_4) %>%
            group_by(Grp_1, Grp_2, Grp_3, Grp_4) %>%
            mutate(id = row_number()),
          by = c("Grp_1", "Grp_2", "Grp_3", "Grp_4", "id")
        )
    },
    times = 10
  )
four_vars
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval
#  arrange_groups_2 356.7114 366.2305 393.7209 377.6245 389.1009 537.6800    10
#          workflow 242.6982 245.5079 252.8441 252.3410 257.7656 267.5277    10
#              join 366.6957 400.1438 438.5274 443.0696 477.5481 505.2293    10
  •  Tags:  
  • Related