I have a numpy array with letters "a", "b" or "c",
import numpy as np
my_array = np.array(["a", "a", "c", "c", "a"]) # In this example "b" is not present
I want to fuild a function f that counts the unique records of each letter present in the array, for my example f should respond [3, 0, 2] meaning that "a" has appeared 3 times, "b" 0 times and "c" 2 times.
I'm looking for solution (if it possible) that use numpy functions and not explicit for loops over the array. Maybe a kind of group by
CodePudding user response:
Counter from the collections builtin will do that for you.
import numpy as np
my_array = np.array(["a", "a", "c", "c", "a"])
from collections import Counter
cnt = Counter(my_array)
cnt
# Counter({'a': 3, 'c': 2})
Note that it does not provide counts for items which did not appear until you ask for them. At that point the counter will return 0.
>>> cnt['b']
0
If you want to wrap that in a function where you already have a list of keys (not all of which may be present in your array data), that will not populate the 0 counts with keys for you. If you want the 0s and the keys to be populated, something like this:
import numpy as np
from collections import Counter
from typing import Dict, Any
def counter_function(data, keys) -> Dict[Any, int]:
cnt = Counter(data)
for key in keys:
cnt[key] = cnt[key]
return cnt
my_array = np.array(["a", "a", "c", "c", "a"])
so_counter = counter_function(my_array, ["a", "b", "c"])
so_counter
# Counter({'a': 3, 'c': 2, 'b': 0})
will do it for you.
CodePudding user response:
You can also use np.unique with return_counts=True, and just convert it to a dict with dict zip:
dct = dict(zip(*np.unique(my_array, return_counts=True)))
Output:
>>> dct
{'a': 3, 'c': 2}
For smaller arrays, Lucas's answer is faster, but for large arrays, numpy is much more efficient.
CodePudding user response:
If my_array has a typical length of about 10 or more, it can be worthwhile to convert your array to the integers [0, 1, 2] and then apply bincount().
Here's an example with your my_array:
In [31]: my_array = np.array(["a", "a", "c", "c", "a"])
In [32]: b = my_array.view(np.int32) - ord('a')
In [33]: b
Out[33]: array([0, 0, 2, 2, 0], dtype=int32)
In [34]: np.bincount(b, minlength=3)
Out[34]: array([3, 0, 2])
Here's a timing comparison of that method and collections.Counter using an input with length 100:
In [34]: rng = np.random.default_rng()
In [35]: a = rng.choice(['a', 'a', 'b', 'c'], size=100)
In [36]: %timeit Counter(a)
32.1 µs ± 723 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [37]: %timeit b = a.view(np.int32) - ord('a'); np.bincount(b, minlength=3)
3.86 µs ± 50.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The approach with bincount() is much faster.
It is also faster than using np.unique() with the parameter return_counts=True:
In [41]: %timeit values, counts = np.unique(a, return_counts=True)
19.7 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
