Count number of common columns between every row of a dataframe to create an all-vs-all matrix-CodePudding

Here is a sample of the data set I have:

rows represent different accessions and columns represent different genes.

    gene1   gene2   gene3
1 PRESENT    LOST PRESENT
2 PRESENT PRESENT    LOST
3    LOST PRESENT PRESENT

I would like to find the number of the common genes between accessions and create a dataframe accession vs accession as such:

> test_self
  1 2 3
1 2 1 1
2 1 2 1
3 1 1 2

I used R to create this dataframe with the following code

gene1  = c("PRESENT", "PRESENT", "LOST")
gene2 = c("LOST", "PRESENT", "PRESENT")
gene3 = c("PRESENT","LOST","PRESENT")
test_PAV <- data.frame (gene1,gene2,gene3)

test_self <- data.frame(matrix(ncol = nrow(test_PAV), nrow = nrow(test_PAV)))
colnames(test_self)<-row.names(test_PAV)
row.names(test_self)<-row.names(test_PAV)

for (rowInQuestion in 1:nrow(test_PAV)){
  for (rowBeingComparedTo in 1:nrow(test_PAV)){
    commonGeneCount <- 0
    for (col in 1:ncol(test_PAV)){
      if(test_PAV[rowInQuestion,col]=='PRESENT'&&test_PAV[rowBeingComparedTo,col]=='PRESENT'){
        commonGeneCount <- commonGeneCount   1
      }
      test_self[rowInQuestion,rowBeingComparedTo]<-commonGeneCount
    } 
  }
}

But I'm well aware that this is a very inefficient solution. First of all, since the upper and the lower triangles are symmetric they don't have to be calculated twice.

I found a similar solution here but there the commonality is across a row but in my case in order to call it a commonality, the same gene (column) has to be 'PRESENT' in both accessions.

Thank you.

CodePudding user response：

First, let's keep only the PRESENT genes and build a frozenset of present genes per row:

s = (
 df.where(df.eq('PRESENT'))
   .stack()
   .reset_index(level=1)
   .groupby(level=0)['level_1'].agg(frozenset)
)

# 1    (gene3, gene1)
# 2    (gene2, gene1)
# 3    (gene3, gene2)
# Name: level_1, dtype: object

Now we can get the itertools.product (or, to be faster the combinations) to build the expected dataframe from the length of the intersection of genes set:

from itertools import product

out = (
 pd.Series({(i,j): len(a.intersection(b))
           for (i,a),(j,b) in  product(zip(s.index, s), repeat=2)})
   .unstack()
)

#    1  2  3
# 1  2  1  1
# 2  1  2  1
# 3  1  1  2

combinations are faster as only 1 of A/B or B/A is computed, and also not A/A:

from itertools import combinations

out = (
 pd.Series({(i,j): len(a.intersection(b))
           for (i,a),(j,b) in combinations(zip(s.index, s), r=2)})
   .unstack()
)

#      2    3
# 1  1.0  1.0
# 2  NaN  1.0

CodePudding user response：

So I came up with another solution as well if anyone wants to use it;

Instead of using a dataframe with 'PRESENT' and 'LOST' I have used sed to transfer them into 1's and 0's.

sed -i '' 's/PRESENT/1/g' pav-matrix.binary.txt
sed -i '' 's/LOST/0/g' pav-matrix.binary.txt

Then used this binary matrix to compare every product of each row's and get the sum of the product array.

for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
  for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
    pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
  }
}