What is an efficient way to find the highest average similarity/distance between two collections?-CodePudding

The problem:

Let's say I have collection A and collection B not necessarily of equal size.

Then I want to find the set of highest scoring pairs (a, b) for each a in A and each b in B.

The main stipulation is that each a in A and each b in B can only be used once. So if score(a1, b1) == score(a1, b2) we can only keep one of the two scores.

Here's a concrete example with a made-up similarity matrix. Each row represents an element from collection A and each column is an element of collection B. so M[i][j] = score(a_i, b_j)

new double[][]{{1, 4, 1, 1}, // 4 occurs twice in a column
               {3, 1, 2, 3}, // 3 occurs twice in a row
               {1, 4, 1, 1}};

We would first say that (0,1) contains the highest score in row 1. So a_0 and b_1 is no longer available for any match-ups.

Next, we would say that (1, 0) or (1, 3) contain the highest score in row 2. Since either is fair game we choose (1, 0). Now, a_1 and b_0 are off-limits.

Finally, we see the highest score in row three is at (2, 1). But because b_1 in B is spoken for, we have to choose something else. We instead choose (2, 3).

So our pairwise highest scoring pairs without repeitition are (a_0, b_1), (a_1, b_0), (a_2, b_3).

Here's what I've tried:

import org.apache.commons.math3.linear.Array2DRowRealMatrix;
import org.apache.commons.math3.linear.RealVector;
import org.apache.commons.math3.util.Pair;

 public static double rankBySimilarity(Array2DRowRealMatrix simMatrix) {

        Set<Integer> rowIdxs =
            IntStream.range(0, simMatrix.getRowDimension()).boxed().collect(Collectors.toSet());
        Set<Integer> colIdxs =
            IntStream.range(0, simMatrix.getColumnDimension()).boxed().collect(Collectors.toSet());

        Set<Pair<Integer, Integer>> bestScoreIdxs = new HashSet<>();

        for (int row : rowIdxs) {
            RealVector rowVec = simMatrix.getRowVector(row);
            int col = rowVec.getMaxIndex();
            bestScoreIdxs.add(new Pair<>(row, col));
            rowIdxs.remove(row);
            colIdxs.remove(col);

            if (rowIdxs.isEmpty() || colIdxs.isEmpty()) {
                break;
            }
        }

        double score = 0;
        for (Pair<Integer, Integer> coord : bestScoreIdxs) {
            int x = coord.getFirst();
            int y = coord.getSecond();
            score  = simMatrix.getEntry(x, y);
        }

        return score / bestScoreIdxs.size();

    }

However, this throws an exception because I'm iterating over and altering a collection at the same time. I have read up and understood the error. What I can't figure out is an efficient alternative.

Maybe going down the path of using a similarity matrix isn't a good idea? Any suggestions or hints are welcome.

Edit I just replaced rowIdxs with rowIdxs.iterator() and stepped through my debugger. The above logic doesn't work even if it doesn't throw an exception.

CodePudding user response：

The main issue was that even though I was tracking used elements/coordinates, I was still querying them. Here I decided to take a different approach to make that impossible:

public static double rankBySimilarity(Array2DRowRealMatrix simMatrix) {
        Set<Integer> rowIdxs =
            IntStream.range(0, simMatrix.getRowDimension()).boxed().collect(Collectors.toSet());
        Set<Integer> colIdxs =
            IntStream.range(0, simMatrix.getColumnDimension()).boxed().collect(Collectors.toSet());

        List<List<Integer>> coords = new ArrayList<>(Sets.cartesianProduct(rowIdxs, colIdxs));
        List<Integer> setA = new ArrayList<>();
        List<Integer> setB = new ArrayList<>();

        Map<List<Integer>, Double> scores = new HashMap<>();
        coords.forEach(c -> scores.put(c, simMatrix.getEntry(c.get(0), c.get(1))));
        coords.sort(Comparator.comparing(scores::get).reversed());

        double score = 0;
        int requiredMet = 0;
        int required = Math.min(rowIdxs.size(), colIdxs.size());
        for (List<Integer> coord : coords) {
            int x = coord.get(0);
            int y = coord.get(1);

            if (!setA.contains(x) && !setB.contains(y)) {
                setA.add(x);
                setB.add(y);
                score  = scores.get(coord);
                requiredMet  = 1;
            }
            if (requiredMet == required) {
                break;
            }
        }

        return required == 0 ? 0 : score / required;
    }

CodePudding user response：

It sounds like you're describing the classic Assignment Problem.

The problem instance has a number of agents and a number of tasks. Any agent can be assigned to perform any task, incurring some cost that may vary depending on the agent-task assignment. It is required to perform as many tasks as possible by assigning at most one agent to each task and at most one task to each agent, in such a way that the total cost of the assignment is minimized.

You have a bunch of agents (rows) of which you want to assign to different tasks (columns), the relationship between the two being one-to-one. You want to minimize the cost (maximize your profit / score).

One option for solving this is to use the Hungarian Algorithm.