numpy has an implementation of the unique algorithm that returns :
- the sorted unique elements of a numpy array (i.e. with no duplicates)
In addition, numpy.unique() can also return :
- the indices of the input array that give the unique values
- the indices of the unique array that reconstruct the input array
The C standard library also implements a unique algorithm (here), that somehow eliminates consecutive duplicates. Combined with std::vector<>::erase and std::sort(), this can return the sorted unique elements of the vector (output 1 of numpy.unique()).
My question is : is there any algorithm in the stl or elsewhere that can also return outputs 2 and 3 of numpy.unique(). If not, is there a way to efficiently implement it ?
CodePudding user response:
A std::map combined with a std::vector gives you all the information you want.
#include <map>
#include <vector>
template <typename Container>
auto unique_map( const Container & xs )
-> std::map <typename Container::value_type, std::vector <std::size_t> >
{
decltype(unique_map(xs)) S;
std::size_t n = 0;
for (const auto & x : xs)
{
S[ x ].push_back( n );
}
return S;
}
Now you can convert any container to a sorted std::map where:
- each key is a unique element from the original container
xs - each value is a
std::vectorlisting indices intoxs
If you wish to reconstruct xs from the mapping, you can do so easily enough:
#include <numeric>
template <typename UniqueMap>
auto rebuild_from_unique_map( const UniqueMap & um )
-> std::vector <typename UniqueMap::key_type> // <std::size_t>
{
decltype(rebuild_from_unique_map(um)) xs;
if (um.size())
{
xs.resize( std::accumulate( um.begin(), um.end(), (std::size_t)0,
[]( auto n, auto p ) { return n p.second.size(); } ) );
for (const auto & m : um)
for (const auto & n : m.second)
xs[n] = m.first; // index
}
return xs;
}
If you really, really want that list of indices instead (return_inverse) you can get it by changing the commented lines as indicated and adding a std::size_t index = 0; before the nested loop.
Here’s a toy to play with it. Just concatenate the three code snippets in this post and compile. You’ll need Boost for the irange.
#include <algorithm>
#include <boost/range/irange.hpp>
#include <cmath>
#include <iomanip>
#include <iostream>
#include <string>
int main( int argc, char ** argv )
{
std::vector <int> xs( argc-1, 0 );
std::transform( argv 1, argv argc, xs.begin(), []( auto s ) { return std::stoi( s ); } );
auto width = std::setw( 3 (int)std::log10( *std::max_element( xs.begin(), xs.end() ) ) );
std::cout << xs.size() << " integers";
std::cout << "\nvalue: "; for (auto x : xs) std::cout << " " << width << x;
std::cout << "\nindex: ["; for (auto n : boost::irange( xs.size() )) std::cout << " " << width << n;
std::cout << " ]\n";
auto ys = unique_map( xs );
std::cout << "\n" << ys.size() << " unique integers";
std::cout << "\nsorted:"; for (auto y : ys) std::cout << " " << width << y.first;
std::cout << "\n\n";
for (const auto & y : ys)
{
std::cout << width << y.first << " appears " << y.second.size() << " times at [";
for (const auto & n : y.second) std::cout << " " << n;
std::cout << " ]\n";
}
std::cout << "\nrebuilt:"; for (auto x : rebuild_from_unique_map( ys )) std::cout << " " << width << x;
std::cout << "\n";
}
Compile with something like:
cl /EHsc /W4 /Ox /std:c 17 /I C:\boost\include a.cppclang -Wall -Wextra -pedantic-errors -O3 -std=c 17 a.cpp- etc
The toy is not tolerant of bad input: make sure that you give it at least one argument and that all arguments are integers. You can change it to work over any arbitrary std::string if you wish: just adjust the first three statements in main() to fill xs with strings and compute width correctly.
Here’s me playing with it on my Windows box:
a 2 -1 2 4 -1 2 3 7 -1 2 0 5
12 integers
value: 2 -1 2 4 -1 2 3 7 -1 2 0 5
index: [ 0 1 2 3 4 5 6 7 8 9 10 11 ]
7 unique integers
sorted: -1 0 2 3 4 5 7
-1 appears 3 times at [ 1 4 8 ]
0 appears 1 times at [ 10 ]
2 appears 4 times at [ 0 2 5 9 ]
3 appears 1 times at [ 6 ]
4 appears 1 times at [ 3 ]
5 appears 1 times at [ 11 ]
7 appears 1 times at [ 7 ]
rebuilt: 2 -1 2 4 -1 2 3 7 -1 2 0 5
a 1 1 1
3 integers
value: 1 1 1
index: [ 0 1 2 ]
1 unique integers
sorted: 1
1 appears 3 times at [ 0 1 2 ]
rebuilt: 1 1 1
a 3 1 2
3 integers
value: 3 1 2
index: [ 0 1 2 ]
3 unique integers
sorted: 1 2 3
1 appears 1 times at [ 1 ]
2 appears 1 times at [ 2 ]
3 appears 1 times at [ 0 ]
rebuilt: 3 1 2
Good question!
CodePudding user response:
Here is a version that just uses one or two temporary vectors. Overall complexity is O(n log(n)) for n input values.
#include <algorithm>
// using std::stable_sort, std::unique, std::unique_copy
#include <iterator>
// using std::back_inserter
#include <vector>
// using std::vector
/** Helper to implement argsort-like behavior */
template<class T>
struct SortPair
{
T value;
std::size_t index;
SortPair(const T& value, std::size_t index)
: value(value), index(index)
{}
SortPair(T&& value, std::size_t index)
: value(std::move(value)), index(index)
{}
bool operator<(const SortPair& o) const
{ return value < o.value; }
bool operator<(const T& o) const
{ return value < o; }
friend bool operator<(const T& left, const SortPair& right)
{ return left < right.value; }
bool operator==(const SortPair& o) const
{ return value == o.value; }
friend void swap(SortPair& left, SortPair& right)
{
using std::swap;
swap(left.value, right.value);
swap(left.index, right.index);
}
};
/**
* Implements numpy.unique
*
* \tparam T scalar value type
* \tparam Iterator input iterator for type T
* \param first, last range of values
* \param index if not null, returns the first indices of each unique value in
* the input range
* \param inverse if not null, returns a mapping to reconstruct the input range
* from the output array. first[i] == returnvalue[inverse[i]]
* \param count if not null, returns the number of times each value appears
* \return sorted unique values from the input range
*/
template<class T, class Iterator>
std::vector<T> unique(Iterator first, Iterator last,
std::vector<std::size_t>* index=nullptr,
std::vector<std::size_t>* inverse=nullptr,
std::vector<std::size_t>* count=nullptr)
{
std::vector<T> uvals;
if(! (index || inverse || count)) { // simple case
uvals.assign(first, last);
using t_iter = typename std::vector<T>::iterator;
const t_iter begin = uvals.begin(), end = uvals.end();
std::sort(begin, end);
uvals.erase(std::unique(begin, end), end);
return uvals;
}
using sort_pair = SortPair<T>;
using sort_pair_iter = typename std::vector<sort_pair>::iterator;
std::vector<sort_pair> sorted;
for(std::size_t i = 0; first != last; i, first)
sorted.emplace_back(*first, i);
if(sorted.empty()) { // early out. Helps with inverse computation
if(index)
index->clear();
if(inverse)
inverse->clear();
if(count)
count->clear();
return uvals;
}
const sort_pair_iter sorted_begin = sorted.begin(), sorted_end = sorted.end();
// stable_sort to keep first unique index
std::stable_sort(sorted_begin, sorted_end);
/*
* Compute the unique values. If we still need the sorted original values,
* this must be a copy. Otherwise we just reuse the sorted vector.
* Note that the equality operator in SortPair only uses the value, not index
*/
std::vector<sort_pair> upairs_copy;
const std::vector<sort_pair>* upairs;
if(inverse || count) {
std::unique_copy(sorted_begin, sorted_end, std::back_inserter(upairs_copy));
upairs = &upairs_copy;
}
else {
sorted.erase(std::unique(sorted_begin, sorted_end), sorted_end);
upairs = &sorted;
}
uvals.reserve(upairs->size());
for(const sort_pair& i: *upairs)
uvals.push_back(i.value);
if(index) {
index->clear();
index->reserve(upairs->size());
for(const sort_pair& i: *upairs)
index->push_back(i.index);
}
if(inverse) {
inverse->assign(sorted.size(), 0);
// Since inverse->resize assigns index 0, we can skip the 0-th outer loop
sort_pair_iter sorted_i =
std::upper_bound(sorted_begin, sorted_end, uvals.front());
for(std::size_t i = 1; i < uvals.size(); i) {
const T& uval = uvals[i];
// we know there is at least one value
do
(*inverse)[sorted_i->index] = i;
while( sorted_i != sorted_end && sorted_i->value == uval);
/*
* We could skip the second comparison by using something like
* std::equal_range to determine the number of entries. We could even
* do something like an exponential search forward but I don't see how
* this is worth the complexity
*/
}
}
if(count) {
count->clear();
count->reserve(uvals.size());
sort_pair_iter range_start = sorted_begin;
for(const T& uval: uvals) {
sort_pair_iter range_end =
std::find_if(std::next(range_start), sorted_end,
[&uval](const sort_pair& i) -> bool {
return i.value != uval;
});
count->push_back(std::distance(range_start, range_end));
range_start = range_end;
}
/*
* Two inefficiencies:
* 1. If we want both inverse and count, we could fold both into
* the same loop
* 2. Again, we could use equal_range or a custom version based on an
* exponential search to reduce the number of comparisons.
* The reason we don't use equal_range is because it has worse
* performance in the worst case (uvals.size() == sorted.size()).
* We could make a heuristic and dispatch between both implementations
*/
}
return uvals;
}
The trick is to implement something resembling np.argsort. Everything else follows naturally. The inverse mapping is a bit tricky but not too hard when you do a pen-and-paper version of it first.
We could also use unordered_map. That would reduce the complexity to something like O(n) O(m log(m)) for n entries with m unique values. But that involves a ton of temporary allocations if the number of unique values not much smaller than n.
