Comparing multiple arrays to find their similarity in C-CodePudding

I am a beginner at C and I'm trying to figure out a solution to compare multiple arrays and find their similarity in percent, depending on which elements are the same.

For example, we have the following struct called Library, filled with Articles:

typedef struct
{
  int articleId;
  int count;
  int bibliography[3];
} Article;

typedef struct
{
  Article **articles;
  int count;
  int allocated;
} Library;

An example of what a filled Library would look like would be the following:

articleId   bibliography
000         [222,111]
111         [333,222]
222         [111]
333         [222,111]

My goal is to find the similarity between all the articles and their bibliography, regardless of the order they are in. In other words, which articles have cited the same articles?

What I mean by similarity: When comparing two arrays, we check the number of integers that have an equal integer in the other array. We then take this number and divide it by the length of the larger array and receive our similarity.

For example, 000 and 333 would be 100% similar, 222 and 333 would be 50% similar, 111 and 222 0% similar, and so on.

I am especially having trouble since the arrays all have different lengths. Does anyone have any ideas? My approach would include nested for loops, but I am not sure if that is the right direction.

My attempt to construct the algorithm in my code:

    double citationSimilarity (Library *lib)
{
    int currentCitation[3];
    int comparingCitation[3];
    int[] similarityResults;

    int similar = 0; //amount of similar citations
    double similarity = 0;
    int total = 0; //highest total amount of citations

    for(int i = 0; i < lib->count; i   ){
      for(int j = 1; j < lib->count; j   ){
              similar = 0;
      similarity = 0;
        for(int k = 0; k < lib->articles[i]->count; k   ){
          for(int l = 0; l < lib->articles[j]->count; l   ){
            currentCitation[k] = lib->articles[i]->bibliography[k];
            comparingCitation[l] = lib->articles[j]->bibliography[l];
            if(lib->articles[i]->articleId != lib->articles[j]->articleId){
            if(currentCitation[k] == comparingCitation[l]){
                similar  ;
                if(similar != 0){
                    similarity = (double)similar / (double)getLargerBibliography(lib->articles[j]->count, lib->articles[i]->count);
                }
              printf ("Great success! %d and %d have the similarity %lf \n",lib->articles[i]->articleId, lib->articles[j]->articleId, similarity);
            }
            }
          }
        }
      }
   }
}



int getLargerBibliography (int bib_a, int bib_b)
{

int max;
int min;


  if (bib_a >= bib_b)
  {
    max = bib_a;
    min = bib_b;
  }
  else
  {
    max = bib_b;
    min = bib_a;
  }

  return max;

}

CodePudding user response：

Comparing two arrays using a merge

Sort the elements of the first array (called array1 henceforth).
Sort the elements of the second array (called array2 henceforth).
Set i to 0.
Set j to 0.
Set count to 0.
While i is less than then number of elements in array1, and
while j is less than the number of elements in array2,
1. If array1[i] is less than array2[j],
  1. Increment i.
2. Else,
  1. If array1[i] is greater than array2[j],
    1. Increment j.
  2. Else,
    1. Increment count.
    2. Increment i.
    3. Increment j.
Find the largest of the length of array1 and the length of array2.
Divide count by the largest of the two array lengths.

This supports repeated elements in the arrays.

The complexity varies by the sorting algorithm used. With a generic sorting algorithm, comparing each pair takes O(1) additional memory, and O(N log N) time. But O(N) sorting can be achieved if the values are truly integers, at the cost of some memory. Either way, the sorting only needs to be performed once per array, not once each time it's compared, so the sorting cost is amortized.

Comparing two arrays using set difference

This is an O(N) memory and O(N) time solution that works even for non-integer keys. Each pass of the loop is more expensive, so it would probably take a large N to get any gain.

Create a set. (An associative array could be used.)
For each value in one of the arrays,
1. Add it to the set.
  (If using an associative array, use the value as the key, and use anything for the value.)
For each value in the other array,
1. If the value is in the previously-created set,
  1. Add one to the count.
Find the length of the first array.
Find the length of the second array.
Find the largest of the two array lengths.
Divide the count by the largest of the two array lengths.

Since it doesn't destroy the set, you can create for each array up front and reuse it when comparing different pairs of arrays!

This doesn't support repeated elements, but it can be adapted quite simply.