I have a vector with values (val) and a vector indicating group membership (group):
vec <- 1:9
group <- rep(1:3, c(2,4,3))
Let's say we have K groups and a total of N values, hence both vectors have length N. The goal is to efficiently construct a sparse 'block-diagonal' matrix where the first column holds the values of group 1, the second column holds the values of group 2, and so on. However, the values should not 'overlap' in the sense that there should be only one value per row, see the solution below. I need to do this thousands of times with K and N very large. Hence, the following, loop-based solution is not efficient enough:
K <- length(unique(group))
N <- length(group)
M <- matrix(0, N, K)
for(k in 1:K){
M[group == k, k] <- vec[group == k]
}
Matrix::Matrix(M, sparse = T)
9 x 3 sparse Matrix of class "dgCMatrix"
[1,] 1 . .
[2,] 2 . .
[3,] . 3 .
[4,] . 4 .
[5,] . 5 .
[6,] . 6 .
[7,] . . 7
[8,] . . 8
[9,] . . 9
For memory reasons, it would be ideal to directly construct a sparse matrix without the intermediate step based on the dense N times K matrix.
EDIT
For the small example given above, it turns out that the loop based solution is quite efficient:
Unit: microseconds
expr min lq mean median uq max neval cld
ben 734.280 771.7000 826.8372 787.5230 805.2710 3185.158 100 b
CJR 711.187 745.1855 813.9948 766.9960 781.7495 4832.476 100 b
original 199.714 221.9520 235.4320 227.9395 236.7065 379.757 100 a
However, when turning to high-dimensional examples (N = 10,000 and K = 1,000), the solution by CJR is the winner speed-wise:
Unit: milliseconds
expr min lq mean median uq max neval cld
ben 128.529311 133.308972 140.032070 135.921289 139.272589 289.668852 100 b
CJR 1.841474 2.055513 2.261732 2.201557 2.395925 6.330544 100 a
original 93.387806 118.348398 171.380301 125.884493 244.421699 365.871433 100 c
CodePudding user response:
Matrix::.bdiag() will let you construct a block-diagonal (sparse) matrix directly from a list of matrices:
mm <- lapply(split(vec, group), matrix)
Matrix::.bdiag(mm)
.bdiag(mm) is approximately equivalent to do.call(Matrix::bdiag, mm): ?bdiag says
The value of ‘bdiag()’ inherits from class ‘CsparseMatrix’, whereas ‘.bdiag()’ returns a ‘TsparseMatrix’.
(the former is a sorted compressed column-oriented form, the latter is in triplet form: ?"TsparseMatrix-class" says 'once [a triplet-oriented matrix] is created, however, the matrix is generally coerced to a ‘CsparseMatrix’ for further operations.')
?bdiag also has a Note:
This function has been written and is efficient for the case of relatively few block matrices which are typically sparse themselves.
So, this solution will definitely be better than what you've got, but further improvements might be possible.
CodePudding user response:
vec <- 1:9
group <- rep(1:3, c(2,4,3))
I would suggest building the row & column indices you need directly and then giving them to the sparse constructor.
i <- unlist(split(vec, group), use.names = F)
j <- vapply(split(vec, group), length, numeric(1))
Matrix::sparseMatrix(i=i,
j=rep(1:length(j), j),
x=vec[i])
9 x 3 sparse Matrix of class "dgCMatrix"
[1,] 1 . .
[2,] 2 . .
[3,] . 3 .
[4,] . 4 .
[5,] . 5 .
[6,] . 6 .
[7,] . . 7
[8,] . . 8
[9,] . . 9
This works when groups are not monotonic:
vec <- 1:9
group <- c(5:1, 2:5)
9 x 5 sparse Matrix of class "dgCMatrix"
[1,] . . . . 1
[2,] . . . 2 .
[3,] . . 3 . .
[4,] . 4 . . .
[5,] 5 . . . .
[6,] . 6 . . .
[7,] . . 7 . .
[8,] . . . 8 .
[9,] . . . . 9
But when groups are monotonic, it can be optimized using rle (as noted in comments):
vec <- 1:9
group <- rep(1:3, c(2,4,3))
j <- rle(group)$length
Matrix::sparseMatrix(i=1:length(group),
j=rep(1:length(j), j),
x=vec)
9 x 3 sparse Matrix of class "dgCMatrix"
[1,] 1 . .
[2,] 2 . .
[3,] . 3 .
[4,] . 4 .
[5,] . 5 .
[6,] . 6 .
[7,] . . 7
[8,] . . 8
[9,] . . 9
CodePudding user response:
You can try the code below
> Matrix(`[<-`(M, cbind(seq_along(group), group), vec))
9 x 3 sparse Matrix of class "dgCMatrix"
[1,] 1 . .
[2,] 2 . .
[3,] . 3 .
[4,] . 4 .
[5,] . 5 .
[6,] . 6 .
[7,] . . 7
[8,] . . 8
[9,] . . 9
Benchmark
microbenchmark(
ben = {
mm <- lapply(split(vec, group), matrix)
Matrix::.bdiag(mm)
},
CJR = {
i <- unlist(split(vec, group), use.names = F)
j <- vapply(split(vec, group), length, numeric(1))
Matrix::sparseMatrix(
i = i,
j = rep(1:length(j), j),
x = vec[i]
)
},
TIC = {
Matrix(`[<-`(M, cbind(seq_along(group), group), vec))
},
check = "equivalent"
)
shows
Unit: microseconds
expr min lq mean median uq max neval
ben 564.0 599.55 662.640 654.25 686.9 1213.5 100
CJR 523.1 564.70 643.524 619.65 675.0 1222.6 100
TIC 165.5 191.90 217.537 208.00 234.7 520.1 100
