I have a vector with values (val) and a vector indicating group membership (group):

vec   <- 1:9
group <- rep(1:3, c(2,4,3))

Let's say we have K groups and a total of N values, hence both vectors have length N. The goal is to efficiently construct a sparse 'block-diagonal' matrix where the first column holds the values of group 1, the second column holds the values of group 2, and so on. However, the values should not 'overlap' in the sense that there should be only one value per row, see the solution below. I need to do this thousands of times with K and N very large. Hence, the following, loop-based solution is not efficient enough:

K     <- length(unique(group))
N     <- length(group)
M     <- matrix(0, N, K)

for(k in 1:K){
  
 M[group == k, k] <- vec[group == k]
        
}

Matrix::Matrix(M, sparse = T)

9 x 3 sparse Matrix of class "dgCMatrix"
           
 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

For memory reasons, it would be ideal to directly construct a sparse matrix without the intermediate step based on the dense N times K matrix.

EDIT

For the small example given above, it turns out that the loop based solution is quite efficient:

Unit: microseconds
     expr     min       lq     mean   median       uq      max neval cld
      ben 734.280 771.7000 826.8372 787.5230 805.2710 3185.158   100   b
      CJR 711.187 745.1855 813.9948 766.9960 781.7495 4832.476   100   b
 original 199.714 221.9520 235.4320 227.9395 236.7065  379.757   100  a

However, when turning to high-dimensional examples (N = 10,000 and K = 1,000), the solution by CJR is the winner speed-wise:

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval cld
      ben 128.529311 133.308972 140.032070 135.921289 139.272589 289.668852   100  b 
      CJR   1.841474   2.055513   2.261732   2.201557   2.395925   6.330544   100 a  
 original  93.387806 118.348398 171.380301 125.884493 244.421699 365.871433   100   c

CodePudding user response：

Matrix::.bdiag() will let you construct a block-diagonal (sparse) matrix directly from a list of matrices:

mm <- lapply(split(vec, group), matrix)
Matrix::.bdiag(mm)

.bdiag(mm) is approximately equivalent to do.call(Matrix::bdiag, mm): ?bdiag says

The value of ‘bdiag()’ inherits from class ‘CsparseMatrix’, whereas ‘.bdiag()’ returns a ‘TsparseMatrix’.

(the former is a sorted compressed column-oriented form, the latter is in triplet form: ?"TsparseMatrix-class" says 'once [a triplet-oriented matrix] is created, however, the matrix is generally coerced to a ‘CsparseMatrix’ for further operations.')

?bdiag also has a Note:

This function has been written and is efficient for the case of relatively few block matrices which are typically sparse themselves.

So, this solution will definitely be better than what you've got, but further improvements might be possible.

CodePudding user response：

vec   <- 1:9
group <- rep(1:3, c(2,4,3))

I would suggest building the row & column indices you need directly and then giving them to the sparse constructor.

i <- unlist(split(vec, group), use.names = F)
j <- vapply(split(vec, group), length, numeric(1))
Matrix::sparseMatrix(i=i,
                     j=rep(1:length(j), j),
                     x=vec[i])

9 x 3 sparse Matrix of class "dgCMatrix"
           
 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

This works when groups are not monotonic:

vec   <- 1:9
group <- c(5:1, 2:5)

9 x 5 sparse Matrix of class "dgCMatrix"
               
 [1,] . . . . 1
 [2,] . . . 2 .
 [3,] . . 3 . .
 [4,] . 4 . . .
 [5,] 5 . . . .
 [6,] . 6 . . .
 [7,] . . 7 . .
 [8,] . . . 8 .
 [9,] . . . . 9

But when groups are monotonic, it can be optimized using rle (as noted in comments):

vec   <- 1:9
group <- rep(1:3, c(2,4,3))

j <- rle(group)$length
Matrix::sparseMatrix(i=1:length(group),
                     j=rep(1:length(j), j),
                     x=vec)

9 x 3 sparse Matrix of class "dgCMatrix"
           
 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

CodePudding user response：

You can try the code below

> Matrix(`[<-`(M, cbind(seq_along(group), group), vec))
9 x 3 sparse Matrix of class "dgCMatrix"

 [1,] 1 . .
 [2,] 2 . .
 [3,] . 3 .
 [4,] . 4 .
 [5,] . 5 .
 [6,] . 6 .
 [7,] . . 7
 [8,] . . 8
 [9,] . . 9

Benchmark

microbenchmark(
  ben = {
    mm <- lapply(split(vec, group), matrix)
    Matrix::.bdiag(mm)
  },
  CJR = {
    i <- unlist(split(vec, group), use.names = F)
    j <- vapply(split(vec, group), length, numeric(1))
    Matrix::sparseMatrix(
      i = i,
      j = rep(1:length(j), j),
      x = vec[i]
    )
  },
  TIC = {
    Matrix(`[<-`(M, cbind(seq_along(group), group), vec))
  },
  check = "equivalent"
)

shows

Unit: microseconds
 expr   min     lq    mean median    uq    max neval
  ben 564.0 599.55 662.640 654.25 686.9 1213.5   100
  CJR 523.1 564.70 643.524 619.65 675.0 1222.6   100
  TIC 165.5 191.90 217.537 208.00 234.7  520.1   100