I have the following code:
import numpy as np
def suborder(x, y):
pos = np.in1d(x, y, assume_unique=True)
return x[pos]
x and y are 1d numpy integer arrays, and the elements of y are a subset of those in x, and neither array has repeats. The result is the elements of y, in the order they appear in x. The code gives the result I want. But the intermediate array pos is the same size as x and in many use cases y is much, much smaller than x. Is there a way I can more directly get the result without allocating the intermediate array pos so as to save some memory?
x is not sorted. In my case its elements are the ids of objects and are the value 0->len(x) but in an unspecified order, and it's sorted in order of a score assigned to each object. The purpose of suborder is to order subsets with that same score order.
x is around 10million elements; and I have many different values for y, some approaching the size of x, all the way down to just a handful of elements.
Edit: I get x from doing an argsort on a set of scores for objects. I had imagined that it would be better to sort once for all scores, and then use that ordering to impose an order on the subsets. It may actually be better to take scores[y], then argsort that and take the elements of y in that order (for each y).
CodePudding user response:
in1d starts with:
if len(ar2) < 10 * len(ar1) ** 0.145 or contains_object:
...
mask = np.zeros(len(ar1), dtype=bool)
for a in ar2:
mask |= (ar1 == a)
return mask
In other words it does an equality test for each element of y. If your size difference isn't that large, then it uses a different method, one based on concatenating the arrays and doing a argsort.
I can imagine doing using np.flatnonzero(ar1==a) to get the equivalent indices, and concatenating them. But that will preserve the y order.
CodePudding user response:
Solution 1
Since items are in range(0, len(x)) and are all unique (ie. permutation), then you can preallocate only one buffer of size len(x) (len(x)*4 bytes in RAM). The strategy is to first build reverse indices once just after sorting x:
idx = np.array(len(x), dtype=np.int32) # Can be reused after each sort of `x`
idx[x] = np.arange(len(x), dtype=np.int32) # Can be filled chunk-by-chunk in a loop if memory matters
Then, you need to filter the y array so all values are in range(0, len(x)). If this is already the case, then skip this step. The operation can be done using yFilt = y[np.logical_and(y >= 0, y < len(x))]. Since y can be quite big, you can do this operation chunk-by-chunk. A simpler, faster solution and more memory-efficient solution would be to filter y on the fly using Numba.
Then, you need to compute x[np.sorted(idx[yFilt])] to reorder items of y like in x. This can be done in-place using the following code:
# Should not allocate any temporary arrays
idx.take(yFilt, out=yFilt)
yFilt.sort()
x.take(yFilt, out=yFilt)
After that, yFilt is now ordered like the items in x. Note that you can mutate y so not to perform any temporary array allocations (although this means y is not used by something else in the code after this operation).
This reordering algorithm runs in O(Ny log Ny) time with Ny = len(y). The pre-computation runs in O(Nx) time with Nx = len(x). It requires 4 (Nx Ny) bytes space for the out-of-place implementation and 4 Nx bytes for the in-place version that perform no allocation to reorder y.
Solution 2
If the previous solution takes too much memory for you, this solution should be the good one despite being much more computationally intensive. It uses only O(8 Ny) bytes (O(4 Ny) for the in-place implementation) and runs in O(Nx log Ny) time. Note that the output array can be preallocated once (and only filled later) to avoid any issue with the GC/allocator.
The idea is to perform a binary search of each value of x in a sorted filtered version of y. Values are appended on the fly in an output array. This solution requires Numba or Cython to be fast (although a complex pure-Numpy implementation using chunks and np.searchsorted can be written).
import numba as nb
# `out` can be preallocated and passed in parameter to
# avoid allocations in hot loops
@nb.njit('int32[:](int32[:], int32[:])')
def orderLike(x, y):
sorted = np.sort(y) # Use y.sort() for an in-place implementation
out = np.empty(len(y), np.int32)
cur = 0
for v in x:
pos = np.searchsorted(sorted, v)
if pos < len(y) and sorted[pos] == v: # Found
out[cur] = v
cur = 1
return out[:cur]
