I am a beginner at C and I'm trying to figure out a solution to compare multiple arrays and find their similarity in percent, depending on which elements are the same.
For example, we have the following struct called Library, filled with Articles:
typedef struct
{
int articleId;
int count;
int bibliography[3];
} Article;
typedef struct
{
Article **articles;
int count;
int allocated;
} Library;
An example of what a filled Library would look like would be the following:
articleId bibliography
000 [222,111]
111 [333,222]
222 [111]
333 [222,111]
My goal is to find the similarity between all the articles and their bibliography, regardless of the order they are in. In other words, which articles have cited the same articles?
What I mean by similarity: When comparing two arrays, we check the number of integers that have an equal integer in the other array. We then take this number and divide it by the length of the larger array and receive our similarity.
For example, 000 and 333 would be 100% similar, 222 and 333 would be 50% similar, 111 and 222 0% similar, and so on.
I am especially having trouble since the arrays all have different lengths. Does anyone have any ideas? My approach would include nested for loops, but I am not sure if that is the right direction.
My attempt to construct the algorithm in my code:
double citationSimilarity (Library *lib)
{
int currentCitation[3];
int comparingCitation[3];
int[] similarityResults;
int similar = 0; //amount of similar citations
double similarity = 0;
int total = 0; //highest total amount of citations
for(int i = 0; i < lib->count; i ){
for(int j = 1; j < lib->count; j ){
similar = 0;
similarity = 0;
for(int k = 0; k < lib->articles[i]->count; k ){
for(int l = 0; l < lib->articles[j]->count; l ){
currentCitation[k] = lib->articles[i]->bibliography[k];
comparingCitation[l] = lib->articles[j]->bibliography[l];
if(lib->articles[i]->articleId != lib->articles[j]->articleId){
if(currentCitation[k] == comparingCitation[l]){
similar ;
if(similar != 0){
similarity = (double)similar / (double)getLargerBibliography(lib->articles[j]->count, lib->articles[i]->count);
}
printf ("Great success! %d and %d have the similarity %lf \n",lib->articles[i]->articleId, lib->articles[j]->articleId, similarity);
}
}
}
}
}
}
}
int getLargerBibliography (int bib_a, int bib_b)
{
int max;
int min;
if (bib_a >= bib_b)
{
max = bib_a;
min = bib_b;
}
else
{
max = bib_b;
min = bib_a;
}
return max;
}
CodePudding user response:
Comparing two arrays using a merge
- Sort the elements of the first array (called
array1henceforth). - Sort the elements of the second array (called
array2henceforth). - Set
ito0. - Set
jto0. - Set
countto0. - While
iis less than then number of elements inarray1, and
whilejis less than the number of elements inarray2,- If
array1[i]is less thanarray2[j],- Increment
i.
- Increment
- Else,
- If
array1[i]is greater thanarray2[j],- Increment
j.
- Increment
- Else,
- Increment
count. - Increment
i. - Increment
j.
- Increment
- If
- If
- Find the largest of the length of
array1and the length ofarray2. - Divide
countby the largest of the two array lengths.
This supports repeated elements in the arrays.
The complexity varies by the sorting algorithm used. With a generic sorting algorithm, comparing each pair takes O(1) additional memory, and O(N log N) time. But O(N) sorting can be achieved if the values are truly integers, at the cost of some memory. Either way, the sorting only needs to be performed once per array, not once each time it's compared, so the sorting cost is amortized.
Comparing two arrays using set difference
This is an O(N) memory and O(N) time solution that works even for non-integer keys. Each pass of the loop is more expensive, so it would probably take a large N to get any gain.
- Create a set. (An associative array could be used.)
- For each value in one of the arrays,
- Add it to the set.
(If using an associative array, use the value as the key, and use anything for the value.)
- Add it to the set.
- For each value in the other array,
- If the value is in the previously-created set,
- Add one to the count.
- If the value is in the previously-created set,
- Find the length of the first array.
- Find the length of the second array.
- Find the largest of the two array lengths.
- Divide the count by the largest of the two array lengths.
Since it doesn't destroy the set, you can create for each array up front and reuse it when comparing different pairs of arrays!
This doesn't support repeated elements, but it can be adapted quite simply.
