Question description
Lets say we have two simple arrays:
query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])
I want to find the indexes of all matching values between the query and match. So in this case the result would be:
value query match
100 0 1
100 0 3
100 5 1
100 5 3
4000 1 2
In reality these arrays will contain millions of items
"Stupid" loop solution
qs = []
query_locs = []
match_locs = []
for i in np.arange(query.size):
q = query[i]
# Get matching indexes in "match"
match_loc = np.where(match == q)[0]
n = match_loc.size
# Update location arrays
match_locs.extend(match_loc)
query_locs.extend(np.repeat(i,n))
# Store the matching value
qs.extend(np.repeat(q,n))
result = np.vstack((qs, query_locs, match_locs)).T
print(result)
[[ 100 0 1]
[ 100 0 3]
[4000 1 2]
[ 100 5 1]
[ 100 5 3]]
(Maybe numba could make this loop pretty fast however when I tried this I got some errors about the signatures, so not sure about that)
Numpy buildins
There are quite some buildin numpy function to solve this problem for unique values, like using searchsorted, intersect1d, however, as also described in the doc, they "Return the sorted, unique values" and thus do not take duplicates into account. Some examples on StackOverflow for this problem with unique values:
- NumPy: Comparing Elements in Two Arrays
- Efficient way to compute intersecting values between two numpy arrays
I could imagine there would be a faster way to do this with numpy instead of a loop, so curious to see an answer!
CodePudding user response:
You could hack around it with newaxis:
>>> comparison = np.equal(query[:, np.newaxis], match[np.newaxis, :])
array([[False, True, False, True, False, False, False],
[False, False, True, False, False, False, False],
[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False],
[False, True, False, True, False, False, False]])
which essentially creates the cartesian product (query x matches) (note the memory cost) and then applies the binary function np.equal to cast each element in the product space to a bool efficiently.
The output can be interpreted by reading it row-wise as: query element i is equal to match element j whenever comparison[i, j] is True.
You can gather the indices of all True pairs with:
list(zip(*comparison.nonzero()))
[(0, 1), (0, 3), (1, 2), (5, 1), (5, 3)]
ps: If the arrays are too long to construct the product, iterating over them element-wise is your only option.
CodePudding user response:
You may transform 1d-arrays to dataframes and make a join, like this:
query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])
dfquery = pd.DataFrame(range(len(query)), index=query, columns=['query'])
dfmatch = pd.DataFrame(range(len(match)), index=match, columns=['match'])
dfquery.join(dfmatch, how='inner')
Result:
query match
100 0 1
100 0 3
100 5 1
100 5 3
4000 1 2
