Fastest way to find all indexes of matching values between two 1D arrays (with duplicates)-CodePudding

Question description

Lets say we have two simple arrays:

query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])

I want to find the indexes of all matching values between the query and match. So in this case the result would be:

value   query   match
100        0    1
100        0    3
100        5    1
100        5    3
4000       1    2

In reality these arrays will contain millions of items

"Stupid" loop solution

qs = []
query_locs = []
match_locs = []

for i in np.arange(query.size):
    q = query[i]
    # Get matching indexes in "match"
    match_loc = np.where(match == q)[0]
    n = match_loc.size
    # Update location arrays
    match_locs.extend(match_loc)
    query_locs.extend(np.repeat(i,n))
    # Store the matching value
    qs.extend(np.repeat(q,n))

result = np.vstack((qs, query_locs, match_locs)).T
print(result)
 [[ 100    0    1]
 [ 100    0    3]
 [4000    1    2]
 [ 100    5    1]
 [ 100    5    3]]

(Maybe numba could make this loop pretty fast however when I tried this I got some errors about the signatures, so not sure about that)

Numpy buildins

There are quite some buildin numpy function to solve this problem for unique values, like using searchsorted, intersect1d, however, as also described in the doc, they "Return the sorted, unique values" and thus do not take duplicates into account. Some examples on StackOverflow for this problem with unique values:

I could imagine there would be a faster way to do this with numpy instead of a loop, so curious to see an answer!

CodePudding user response：

You could hack around it with newaxis:

>>> comparison = np.equal(query[:, np.newaxis], match[np.newaxis, :])
array([[False,  True, False,  True, False, False, False],
       [False, False,  True, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False],
       [False,  True, False,  True, False, False, False]])

which essentially creates the cartesian product (query x matches) (note the memory cost) and then applies the binary function np.equal to cast each element in the product space to a bool efficiently. The output can be interpreted by reading it row-wise as: query element i is equal to match element j whenever comparison[i, j] is True. You can gather the indices of all True pairs with:

list(zip(*comparison.nonzero()))
[(0, 1), (0, 3), (1, 2), (5, 1), (5, 3)]

ps: If the arrays are too long to construct the product, iterating over them element-wise is your only option.

CodePudding user response：

You may transform 1d-arrays to dataframes and make a join, like this:

query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])
dfquery = pd.DataFrame(range(len(query)), index=query, columns=['query'])
dfmatch = pd.DataFrame(range(len(match)), index=match, columns=['match'])
dfquery.join(dfmatch, how='inner')

Result:

    query   match
100     0       1
100     0       3
100     5       1
100     5       3
4000    1       2