I am dealing with a dataset in which I want to remove duplicates. Duplicates are defined by having the same value for three fields stored as int64.
I am using C 17. I want my code to be as fast as possible (memory is less of a constraint). I do not care about ordering. I know nothing about the distribution of the int64 values.
My idea is to use an unordered_set with a hash of the three int64 as a key.
Here are my questions:
- Is the unordered_set the best option? How about a map?
- Which hash function should I use?
- Is it a good idea to put the three int64 into a string then hash that string?
Thanks for your help.
CodePudding user response:
I would use:
std::unordered_map<uint64_t, std::unordered_map<uint64_t, std::unordered_set<uint64_t>>>
- Is the unordered_set the best option? How about a map?
Anything unordered_ (I believe) will use hash tables, ordered - some kind of binary tree.
- Which hash function should I use?
Whatever std:: provides for uint64_t, unless you have a reason to believe that you can do better.
- Is it a good idea to put the three int64 into a string then hash that string?
What can you do with strings that you can't with integers? It most likely will be longer...
CodePudding user response:
Thanks, Vlad!
I also found an interesting implementation here: How do I combine hash values in C 0x?
inline void hash_combine(std::size_t& seed) { }
template <typename T, typename... Rest>
inline void hash_combine(std::size_t& seed, const T& v, Rest... rest) {
std::hash<T> hasher;
seed ^= hasher(v) 0x9e3779b9 (seed<<6) (seed>>2);
hash_combine(seed, rest...);
}
Usage:
std::size_t h=0;
hash_combine(h, obj1, obj2, obj3);
which I then plug into a flat std::unordered_set<std::size_t>.
This is roughly 2.5x faster than your proposal on my machine. On the other hand, your solution is simpler to read and does not require a hand-crafted hasher.
And then this solution is even faster:
bool insert_key_and_exists (const int64_t &a, const int64_t &b, std::unordered_set< std::pair<int64_t, int64_t> > *dict)
{
auto c = std::make_pair(a,b);
if (dict->contains(c))
return true;
dict->insert(c);
return false;
}
CodePudding user response:
Hash tables used in things like unordered_map are excellent for large objects. Especially if the key is smaller. But for this you would probably get the best speed using a vector. Sort it. Then run std::unique
#include <cstdint>
#include <vector>
#include <array>
#include <algorithm>
#include <iostream>
// sort and remove duplicates
void remove_dups(std::vector<std::array<int64_t, 3>>& v)
{
std::sort(v.begin(), v.end());
v.erase(std::unique(v.begin(), v.end()), v.end());
}
int main()
{
std::vector<std::array<int64_t, 3>> v{ {1,2,3}, {3,2,1}, {1,2,3}, {2,3,4} };
remove_dups(v);
for (const auto& i3 : v)
{
for (const auto& i : i3)
std::cout << i << " ";
std::cout << '\n';
}
}
