C equivalent of numpy.unique on std::vector with return_index and return

numpy has an implementation of the unique algorithm that returns :

the sorted unique elements of a numpy array (i.e. with no duplicates)

In addition, numpy.unique() can also return :

the indices of the input array that give the unique values
the indices of the unique array that reconstruct the input array

The C standard library also implements a unique algorithm (here), that somehow eliminates consecutive duplicates. Combined with std::vector<>::erase and std::sort(), this can return the sorted unique elements of the vector (output 1 of numpy.unique()).

My question is : is there any algorithm in the stl or elsewhere that can also return outputs 2 and 3 of numpy.unique(). If not, is there a way to efficiently implement it ?

CodePudding user response：

A std::map combined with a std::vector gives you all the information you want.

#include <map>
#include <vector>

template <typename Container>
auto unique_map( const Container & xs )
-> std::map <typename Container::value_type, std::vector <std::size_t> >
{
  decltype(unique_map(xs)) S;

  std::size_t n = 0;
  for (const auto & x : xs)
  {
    S[ x ].push_back( n   );
  }

  return S;
}

Now you can convert any container to a sorted std::map where:

each key is a unique element from the original container xs
each value is a std::vector listing indices into xs

If you wish to reconstruct xs from the mapping, you can do so easily enough:

#include <numeric>

template <typename UniqueMap>
auto rebuild_from_unique_map( const UniqueMap & um )
-> std::vector <typename UniqueMap::key_type>  // <std::size_t>
{
  decltype(rebuild_from_unique_map(um)) xs;
  if (um.size())
  {
    xs.resize( std::accumulate( um.begin(), um.end(), (std::size_t)0, 
        []( auto n, auto p ) { return n   p.second.size(); } ) );
    for (const auto & m : um)
      for (const auto & n : m.second)
        xs[n] = m.first;                       // index  
  }
  return xs;
}

If you really, really want that list of indices instead (return_inverse) you can get it by changing the commented lines as indicated and adding a std::size_t index = 0; before the nested loop.

Here’s a toy to play with it. Just concatenate the three code snippets in this post and compile. You’ll need Boost for the irange.

#include <algorithm>
#include <boost/range/irange.hpp>
#include <cmath>
#include <iomanip>
#include <iostream>
#include <string>

int main( int argc, char ** argv )
{
  std::vector <int> xs( argc-1, 0 );
  std::transform( argv 1, argv argc, xs.begin(), []( auto s ) { return std::stoi( s ); } );

  auto width = std::setw( 3   (int)std::log10( *std::max_element( xs.begin(), xs.end() ) ) );

  std::cout << xs.size() << " integers";
  std::cout << "\nvalue:  "; for (auto x : xs)                         std::cout << " " << width << x;
  std::cout << "\nindex: ["; for (auto n : boost::irange( xs.size() )) std::cout << " " << width << n;
  std::cout << " ]\n";

  auto ys = unique_map( xs );

  std::cout << "\n" << ys.size() << " unique integers";
  std::cout << "\nsorted:"; for (auto y : ys) std::cout << " " << width << y.first;

  std::cout << "\n\n";
  for (const auto & y : ys)
  {
    std::cout << width << y.first << " appears " << y.second.size() << " times at [";
    for (const auto & n : y.second) std::cout << " " << n;
    std::cout << " ]\n";
  }
  
  std::cout << "\nrebuilt:"; for (auto x : rebuild_from_unique_map( ys )) std::cout << " " << width << x;
  std::cout << "\n";
}

Compile with something like:

cl /EHsc /W4 /Ox /std:c 17 /I C:\boost\include a.cpp
clang -Wall -Wextra -pedantic-errors -O3 -std=c 17 a.cpp
etc

The toy is not tolerant of bad input: make sure that you give it at least one argument and that all arguments are integers. You can change it to work over any arbitrary std::string if you wish: just adjust the first three statements in main() to fill xs with strings and compute width correctly.

Here’s me playing with it on my Windows box:

a 2 -1 2 4 -1 2 3 7 -1 2 0 5

12 integers
value:     2  -1   2   4  -1   2   3   7  -1   2   0   5
index: [   0   1   2   3   4   5   6   7   8   9  10  11 ]

7 unique integers
sorted:  -1   0   2   3   4   5   7

 -1 appears 3 times at [ 1 4 8 ]
  0 appears 1 times at [ 10 ]
  2 appears 4 times at [ 0 2 5 9 ]
  3 appears 1 times at [ 6 ]
  4 appears 1 times at [ 3 ]
  5 appears 1 times at [ 11 ]
  7 appears 1 times at [ 7 ]

rebuilt:   2  -1   2   4  -1   2   3   7  -1   2   0   5

a 1 1 1

3 integers
value:     1   1   1
index: [   0   1   2 ]

1 unique integers
sorted:   1

  1 appears 3 times at [ 0 1 2 ]

rebuilt:   1   1   1

a 3 1 2

3 integers
value:     3   1   2
index: [   0   1   2 ]

3 unique integers
sorted:   1   2   3

  1 appears 1 times at [ 1 ]
  2 appears 1 times at [ 2 ]
  3 appears 1 times at [ 0 ]

rebuilt:   3   1   2

Good question!

CodePudding user response：

Here is a version that just uses one or two temporary vectors. Overall complexity is O(n log(n)) for n input values.

#include <algorithm>
// using std::stable_sort, std::unique, std::unique_copy
#include <iterator>
// using std::back_inserter
#include <vector>
// using std::vector


/** Helper to implement argsort-like behavior */
template<class T>
struct SortPair
{
  T value;
  std::size_t index;

  SortPair(const T& value, std::size_t index)
    : value(value), index(index)
  {}
  SortPair(T&& value, std::size_t index)
    : value(std::move(value)), index(index)
  {}
  bool operator<(const SortPair& o) const
  { return value < o.value; }

  bool operator<(const T& o) const
  { return value < o; }

  friend bool operator<(const T& left, const SortPair& right)
  { return left < right.value; }

  bool operator==(const SortPair& o) const
  { return value == o.value; }

  friend void swap(SortPair& left, SortPair& right)
  {
      using std::swap;
      swap(left.value, right.value);
      swap(left.index, right.index);
  }
};
/**
 * Implements numpy.unique
 *
 * \tparam T scalar value type
 * \tparam Iterator input iterator for type T
 * \param first, last range of values
 * \param index if not null, returns the first indices of each unique value in
 *    the input range
 * \param inverse if not null, returns a mapping to reconstruct the input range
 *    from the output array. first[i] == returnvalue[inverse[i]]
 * \param count if not null, returns the number of times each value appears
 * \return sorted unique values from the input range
 */
template<class T, class Iterator>
std::vector<T> unique(Iterator first, Iterator last,
                      std::vector<std::size_t>* index=nullptr,
                      std::vector<std::size_t>* inverse=nullptr,
                      std::vector<std::size_t>* count=nullptr)
{
  std::vector<T> uvals;
  if(! (index || inverse || count)) { // simple case
    uvals.assign(first, last);
    using t_iter = typename std::vector<T>::iterator;
    const t_iter begin = uvals.begin(), end = uvals.end();
    std::sort(begin, end);
    uvals.erase(std::unique(begin, end), end);
    return uvals;
  }
  using sort_pair = SortPair<T>;
  using sort_pair_iter = typename std::vector<sort_pair>::iterator;
  std::vector<sort_pair> sorted;
  for(std::size_t i = 0; first != last;   i,   first)
    sorted.emplace_back(*first, i);
  if(sorted.empty()) { // early out. Helps with inverse computation
    if(index)
      index->clear();
    if(inverse)
      inverse->clear();
    if(count)
      count->clear();
    return uvals;
  }
  const sort_pair_iter sorted_begin = sorted.begin(), sorted_end = sorted.end();
  // stable_sort to keep first unique index
  std::stable_sort(sorted_begin, sorted_end);
  /*
   * Compute the unique values. If we still need the sorted original values,
   * this must be a copy. Otherwise we just reuse the sorted vector.
   * Note that the equality operator in SortPair only uses the value, not index
   */
  std::vector<sort_pair> upairs_copy;
  const std::vector<sort_pair>* upairs;
  if(inverse || count) {
    std::unique_copy(sorted_begin, sorted_end, std::back_inserter(upairs_copy));
    upairs = &upairs_copy;
  }
  else {
    sorted.erase(std::unique(sorted_begin, sorted_end), sorted_end);
    upairs = &sorted;
  }
  uvals.reserve(upairs->size());
  for(const sort_pair& i: *upairs)
    uvals.push_back(i.value);
  if(index) {
    index->clear();
    index->reserve(upairs->size());
    for(const sort_pair& i: *upairs)
      index->push_back(i.index);
  }
  if(inverse) {
    inverse->assign(sorted.size(), 0);
    // Since inverse->resize assigns index 0, we can skip the 0-th outer loop
    sort_pair_iter sorted_i =
      std::upper_bound(sorted_begin, sorted_end, uvals.front());
    for(std::size_t i = 1; i < uvals.size();   i) {
      const T& uval = uvals[i];
      // we know there is at least one value
      do
        (*inverse)[sorted_i->index] = i;
      while(  sorted_i != sorted_end && sorted_i->value == uval);
      /*
       * We could skip the second comparison by using something like
       * std::equal_range to determine the number of entries. We could even
       * do something like an exponential search forward but I don't see how
       * this is worth the complexity
       */
    }
  }
  if(count) {
    count->clear();
    count->reserve(uvals.size());
    sort_pair_iter range_start = sorted_begin;
    for(const T& uval: uvals) {
      sort_pair_iter range_end =
        std::find_if(std::next(range_start), sorted_end,
                     [&uval](const sort_pair& i) -> bool {
                       return i.value != uval;
                     });
      count->push_back(std::distance(range_start, range_end));
      range_start = range_end;
    }
    /*
     * Two inefficiencies:
     * 1. If we want both inverse and count, we could fold both into
     *    the same loop
     * 2. Again, we could use equal_range or a custom version based on an
     *    exponential search to reduce the number of comparisons.
     *    The reason we don't use equal_range is because it has worse
     *    performance in the worst case (uvals.size() == sorted.size()).
     *    We could make a heuristic and dispatch between both implementations
     */
  }
  return uvals;
}

The trick is to implement something resembling np.argsort. Everything else follows naturally. The inverse mapping is a bit tricky but not too hard when you do a pen-and-paper version of it first.

We could also use unordered_map. That would reduce the complexity to something like O(n) O(m log(m)) for n entries with m unique values. But that involves a ton of temporary allocations if the number of unique values not much smaller than n.