Interpretation of counts for `numpy.unique` when applied on a matrix-CodePudding

numpy.unique has an optional argument return_counts. From the docs:

return_counts bool, optional If True, also return the number of times each unique item appears in ar.

New in version 1.9.0.

Which is straightforward for a 1-D array. However, I'm trying to the unique values and counts for each row of a matrix. Here is a sample matrix:

m_sample = np.array([
    [1, 2, 1],
    [2, 2, 2],
    [3, 3, 3],
    [1, 4, 5],
])

When I apply np.unique:

np.unique(m_sample, axis=1, return_counts=True)

(array([[1, 1, 2],
        [2, 2, 2],
        [3, 3, 3],
        [1, 5, 4]]),  array([1, 1, 1]))

I'm not really sure what the returned matrix here represents, much less so the counts array. Is this perhaps a bug in numpy (or maybe a case the developer did not consider)? Am I misunderstanding how to use the parameters in this case?

CodePudding user response：

When you specify an axis, np.unique returns unique subarrays indexed along this axis. To see is better, assume that one of the rows repeats:

m_sample = np.array([
    [1, 2, 1],
    [2, 2, 2],
    [3, 3, 3],
    [1, 4, 5],
    [1, 2, 1]
])

In such case np.unique(m_sample, axis=0, return_counts=True) gives:

(array([[1, 2, 1],
        [1, 4, 5],
        [2, 2, 2],
        [3, 3, 3]]),
 array([2, 1, 1, 1]))

The first element of this tuple lists unique rows of the array, and the second how many times each row appears in the array. In this example, the row [1, 2, 1] is repeated twice.

To get unique values in each row you can try, for example, the following:

import numpy as np

m_sample = np.array([
    [1, 2, 1],
    [2, 2, 2],
    [3, 3, 3],
    [1, 4, 5]
])

s = np.sort(m_sample, axis=1)
mask = np.full(m_sample.shape, True)
mask[:, 1:] = s[:, :-1] != s[:, 1:]
np.split(s[mask], np.cumsum(mask.sum(axis=1)))[:-1]

It gives:

[array([1, 2]), array([2]), array([3]), array([1, 4, 5])]