Removing duplicates with three int64 as a key in C-CodePudding

I am dealing with a dataset in which I want to remove duplicates. Duplicates are defined by having the same value for three fields stored as int64.

I am using C 17. I want my code to be as fast as possible (memory is less of a constraint). I do not care about ordering. I know nothing about the distribution of the int64 values.

My idea is to use an unordered_set with a hash of the three int64 as a key.

Here are my questions:

Is the unordered_set the best option? How about a map?
Which hash function should I use?
Is it a good idea to put the three int64 into a string then hash that string?

Thanks for your help.

CodePudding user response：

I would use:

std::unordered_map<uint64_t, std::unordered_map<uint64_t, std::unordered_set<uint64_t>>>

Is the unordered_set the best option? How about a map?

Anything unordered_ (I believe) will use hash tables, ordered - some kind of binary tree.

Which hash function should I use?

Whatever std:: provides for uint64_t, unless you have a reason to believe that you can do better.

Is it a good idea to put the three int64 into a string then hash that string?

What can you do with strings that you can't with integers? It most likely will be longer...

CodePudding user response：

Thanks, Vlad!

I also found an interesting implementation here: How do I combine hash values in C 0x?

inline void hash_combine(std::size_t& seed) { }

template <typename T, typename... Rest>
inline void hash_combine(std::size_t& seed, const T& v, Rest... rest) {
    std::hash<T> hasher;
    seed ^= hasher(v)   0x9e3779b9   (seed<<6)   (seed>>2);
    hash_combine(seed, rest...);
}

Usage:

std::size_t h=0;
hash_combine(h, obj1, obj2, obj3);

which I then plug into a flat std::unordered_set<std::size_t>.

This is roughly 2.5x faster than your proposal on my machine. On the other hand, your solution is simpler to read and does not require a hand-crafted hasher.

And then this solution is even faster:

bool insert_key_and_exists (const int64_t &a, const int64_t &b, std::unordered_set< std::pair<int64_t, int64_t> > *dict)
{
    auto c = std::make_pair(a,b);
    if (dict->contains(c))
        return true;
    dict->insert(c);
    return false;
}

CodePudding user response：

Hash tables used in things like unordered_map are excellent for large objects. Especially if the key is smaller. But for this you would probably get the best speed using a vector. Sort it. Then run std::unique

#include <cstdint>
#include <vector>
#include <array>
#include <algorithm>
#include <iostream>

// sort and remove duplicates
void remove_dups(std::vector<std::array<int64_t, 3>>& v)
{
    std::sort(v.begin(), v.end());
    v.erase(std::unique(v.begin(), v.end()), v.end());
}

int main()
{
    std::vector<std::array<int64_t, 3>> v{ {1,2,3}, {3,2,1}, {1,2,3}, {2,3,4} };
    remove_dups(v);
    for (const auto& i3 : v)
    {
        for (const auto& i : i3)
            std::cout << i << " ";
        std::cout << '\n';
    }
}