Count unique register in numpy array-CodePudding

I have a numpy array with letters "a", "b" or "c",

import numpy as np

my_array = np.array(["a", "a", "c", "c", "a"]) # In this example "b" is not present

I want to fuild a function f that counts the unique records of each letter present in the array, for my example f should respond [3, 0, 2] meaning that "a" has appeared 3 times, "b" 0 times and "c" 2 times.

I'm looking for solution (if it possible) that use numpy functions and not explicit for loops over the array. Maybe a kind of group by

CodePudding user response：

Counter from the collections builtin will do that for you.

import numpy as np 
my_array = np.array(["a", "a", "c", "c", "a"])
from collections import Counter
cnt = Counter(my_array)
cnt 
#  Counter({'a': 3, 'c': 2})

Note that it does not provide counts for items which did not appear until you ask for them. At that point the counter will return 0.

>>> cnt['b']
0

If you want to wrap that in a function where you already have a list of keys (not all of which may be present in your array data), that will not populate the 0 counts with keys for you. If you want the 0s and the keys to be populated, something like this:

import numpy as np
from collections import Counter
from typing import Dict, Any


def counter_function(data, keys) -> Dict[Any, int]:
    cnt = Counter(data)
    for key in keys:
        cnt[key] = cnt[key]
    return cnt

my_array = np.array(["a", "a", "c", "c", "a"])
so_counter = counter_function(my_array, ["a", "b", "c"])
so_counter
# Counter({'a': 3, 'c': 2, 'b': 0})

will do it for you.

CodePudding user response：

You can also use np.unique with return_counts=True, and just convert it to a dict with dict zip:

dct = dict(zip(*np.unique(my_array, return_counts=True)))

Output:

>>> dct
{'a': 3, 'c': 2}

For smaller arrays, Lucas's answer is faster, but for large arrays, numpy is much more efficient.

CodePudding user response：

If my_array has a typical length of about 10 or more, it can be worthwhile to convert your array to the integers [0, 1, 2] and then apply bincount().

Here's an example with your my_array:

In [31]: my_array = np.array(["a", "a", "c", "c", "a"])

In [32]: b = my_array.view(np.int32) - ord('a')

In [33]: b
Out[33]: array([0, 0, 2, 2, 0], dtype=int32)

In [34]: np.bincount(b, minlength=3)
Out[34]: array([3, 0, 2])

Here's a timing comparison of that method and collections.Counter using an input with length 100:

In [34]: rng = np.random.default_rng()

In [35]: a = rng.choice(['a', 'a', 'b', 'c'], size=100)

In [36]: %timeit Counter(a)
32.1 µs ± 723 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [37]: %timeit b = a.view(np.int32) - ord('a'); np.bincount(b, minlength=3)
3.86 µs ± 50.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The approach with bincount() is much faster.

It is also faster than using np.unique() with the parameter return_counts=True:

In [41]: %timeit values, counts = np.unique(a, return_counts=True)
19.7 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)