Faster method for searchsorted when both arrays are sorted-CodePudding

numpy.searchsorted(a, v, ...) basically finds the indices of the first elements in a sorted array a that are larger than or equal to the elements in v. I assume that this is faster than a method that does not take advantage of the fact that a is sorted.

Now what if v is also sorted? Certainly it should be possible to exploit this knowledge to make the algorithm faster? (Perhaps I'm wrong and the design of searchsorted makes this irrelevant; if so please excuse my question.)

So the question is: What would be the fastest way to do the same as searchsorted(a, v) but for both a and v sorted?

CodePudding user response：

Off the top of my head, I would assume that you could solve the problem thusly:

def searchsorted_sorted(x, a):
    y = np.empty_like(a)

    l = 0
    for (i, s) in enumerate(a):
        pos = l   np.searchsorted(x[l:], s)
        y[i] = pos
        l = y[i]

    return y

So, you could use the knowledge of v being ordered to restrict the search space for subsequent searches.

Since numpy likely uses binary search, the difference should be more pronounced when the values in v are cluttered towards the end of a. If you have a large number M of entries and only search for the indices of the last n, the second to n-th search restrict the length to be searched from M to M - n. Thus, rather than taking O(n*log(M)) you would only need O(n*log(n)), which in theory makes a difference. Caveats:

Maybe numpy already does something similar?
Since numpy is implemented in C, it is not a fair comparison with this function (partly) written in python.
This implementation is rather cache-unfriendly, further reducing its usefulness.

CodePudding user response：

When a and v have approximately the same size, the best algorithm consist in merging the two arrays (see: merge algorithm) (note you only need the indices and not the output merged array). This can be done in linear time (ie. O(n M)) instead of a quasi-linear O(n log M) time.

Otherwise the problem can be solved in O(n log log M) time instead of O(n log n) time. This is only interesting if a is huge since log M is already quite small and one should be careful about hidden constants in algorithmic complexities. The algorithm is not trivial to implement. Here is the idea:

Pick the middle value v[k] of v and split virtually v in two part (left and right ones)
Perform a binary search of v[k] in a so to get its location p
Now we can be sure that the items in the left part of v are located in a[:p 1] and the right part are in a[p:] (one side can be excluded in case of equality but let put this aside for sake of simplicity)
Recursively apply the operation by calling search(a[:p 1], v[:k]) and search(a[p:], v[k 1:])

Note that for the second algorithm, if v contains i multiple same values, then you can copy the location found i times and find a more strict lower/upper bound in a so to avoid degenerated case possibly resulting in a worse complexity.

Because of the CPython loop overhead and the one of Numpy function calls, this implementation needs to be implemented in native languages like C/C or in Cython or possibly using JITs like Numba.