NumPy maxima of groups defined by a label array-CodePudding

I have two arrays, one is a list of values and one is a list of IDs corresponding to each value. Some IDs have multiple values. I want to create a new array that contains the maximum value recorded for each id, which will have a length equal to the number of unique ids.

Example using a for loop:

import numpy as np

values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

uniq_ids = np.unique(ids)
maximums = np.ones_like(uniq_ids) * np.nan

for i, id in enumerate(uniq_ids):
    maximums[i] = np.max(values[np.where(ids == id)])

print(uniq_ids)
print(maximums)

[0 1 3 5 6]
[5. 3. 6. 8. 8.]

Is it possible to vectorize this so it runs fast? I'm imagining a one-liner that can create the "maximums" array using only NumPy functions, but I haven't been able to come up with anything that works.

CodePudding user response：

Use np.lexsort to sort both lists simultaneously:

idx = np.lexsort([values, ids])

The indices of the last unique ID are given by

last = np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]

You can use this to get the maxima of each group:

values[idx[last]]

This is the same as values[idx][last], but faster because you only need to extract len(last) elements this way, instead of rearranging the whole array and then extracting.

Keep in mind that np.unique is basically doing sort and flatnonzero when you do return_index=True.

Here's a collection of all the solutions so far, along with a benchmark:

def Shannon(values, ids):
    uniq_ids = np.unique(ids)
    maxima = np.ones_like(uniq_ids) * np.nan
    for i, id in enumerate(uniq_ids):
        maxima[i] = np.max(values[np.where(ids == id)])
    return maxima

def richardec(values, ids):
    return [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]

def MadPhysicist(values, ids):
    idx = np.lexsort([values, ids])
    return values[idx[np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]]]

def PeptideWitch(values, ids):
    return np.vectorize(lambda id: np.max(values[np.where(ids == id)]))(np.unique(ids))

values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

%timeit Shannon(values, ids)
42.1 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit richardec(values, ids)
27.7 µs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit MadPhysicist(values, ids)
18.9 µs ± 90.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit PeptideWitch(values, ids)
54.9 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)

%timeit Shannon(values, ids)
1.76 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit richardec(values, ids)
30.6 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit MadPhysicist(values, ids)
880 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit PeptideWitch(values, ids)
1.75 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it seems that lexsorting is the fastest approach all around. For large arrays (which are the only ones that matter), it seems that finding the maxima individually is comparable to the question, while richardec's comprehension is much slower, despite the fact that it does not sort the input values according to the IDs.

CodePudding user response：

Here's a solution, which, although not 100% vectorized, (per my benchmarks) takes about half the time as your does (using your sample data). The performance improvement probably becomes more drastic with more data:

maximums = [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]

Output:

>>> maximums
[5, 3, 6, 8, 8]

CodePudding user response：

In trying to visualize the problem:

In [82]: [np.where(ids==id) for id in uniq_ids]
Out[82]: 
[(array([0]),),
 (array([1]),),
 (array([2, 3, 4, 5]),),
 (array([6]),),
 (array([7, 8, 9]),)]

unique can also return:

In [83]: np.unique(ids, return_inverse=True)
Out[83]: (array([0, 1, 3, 5, 6]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))

Which is a variante on what richardec produced:

In [88]: [a for a in np.split(ids, np.arange(1, ids.shape[0])[(np.diff(ids) !=
    ...: 0)])]
Out[88]: [array([0]), array([1]), array([3, 3, 3, 3]), array([5]), array([6, 6, 6])]

That inverse is also produced by doing where on all == at once:

In [90]: ids[:,None] == uniq_ids
Out[90]: 
array([[ True, False, False, False, False],
       [False,  True, False, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False],
       [False, False, False, False,  True],
       [False, False, False, False,  True],
       [False, False, False, False,  True]])
In [91]: np.nonzero(ids[:,None] == uniq_ids)
Out[91]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))

I'm still thinking through this ...

CodePudding user response：

EDIT: I'll leave this up as an example of why we can't always use np.vectorize() to make everything magically faster:

One solution is to use numpy's vectorize function:

import numpy as np

values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

def my_func(id):
    return np.max(values[np.where(ids==id)])

vector_func = np.vectorize(my_func)
maximums = vector_func(np.unique(ids))

which returns

array([5, 3, 6, 8, 8])

But as for speed, your version has about the same performance when we use

values = np.array([random.randint(1, 100) for i in range(1000000)])
ids = []
for i in range(100000):
    r = random.randint(1, 4)
    if r == 3:
        for x in range(3):
            ids.append(i)
    elif r == 2:
        for x in range(4):
            ids.append(i)
    else:
        ids.append(i)
ids = np.array(ids)

It's about 12 seconds per execution.

CodePudding user response：

With pandas:

import pandas as pd

df = pd.DataFrame({'ids': ids, 'vals': values})
df.groupby('ids')['vals'].max().to_numpy()