Here is a sample of the data set I have:
rows represent different accessions and columns represent different genes.
gene1 gene2 gene3
1 PRESENT LOST PRESENT
2 PRESENT PRESENT LOST
3 LOST PRESENT PRESENT
I would like to find the number of the common genes between accessions and create a dataframe accession vs accession as such:
> test_self
1 2 3
1 2 1 1
2 1 2 1
3 1 1 2
I used R to create this dataframe with the following code
gene1 = c("PRESENT", "PRESENT", "LOST")
gene2 = c("LOST", "PRESENT", "PRESENT")
gene3 = c("PRESENT","LOST","PRESENT")
test_PAV <- data.frame (gene1,gene2,gene3)
test_self <- data.frame(matrix(ncol = nrow(test_PAV), nrow = nrow(test_PAV)))
colnames(test_self)<-row.names(test_PAV)
row.names(test_self)<-row.names(test_PAV)
for (rowInQuestion in 1:nrow(test_PAV)){
for (rowBeingComparedTo in 1:nrow(test_PAV)){
commonGeneCount <- 0
for (col in 1:ncol(test_PAV)){
if(test_PAV[rowInQuestion,col]=='PRESENT'&&test_PAV[rowBeingComparedTo,col]=='PRESENT'){
commonGeneCount <- commonGeneCount 1
}
test_self[rowInQuestion,rowBeingComparedTo]<-commonGeneCount
}
}
}
But I'm well aware that this is a very inefficient solution. First of all, since the upper and the lower triangles are symmetric they don't have to be calculated twice.
I found a similar solution here but there the commonality is across a row but in my case in order to call it a commonality, the same gene (column) has to be 'PRESENT' in both accessions.
Thank you.
CodePudding user response:
First, let's keep only the PRESENT genes and build a frozenset of present genes per row:
s = (
df.where(df.eq('PRESENT'))
.stack()
.reset_index(level=1)
.groupby(level=0)['level_1'].agg(frozenset)
)
# 1 (gene3, gene1)
# 2 (gene2, gene1)
# 3 (gene3, gene2)
# Name: level_1, dtype: object
Now we can get the itertools.product (or, to be faster the combinations) to build the expected dataframe from the length of the intersection of genes set:
from itertools import product
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in product(zip(s.index, s), repeat=2)})
.unstack()
)
# 1 2 3
# 1 2 1 1
# 2 1 2 1
# 3 1 1 2
combinations are faster as only 1 of A/B or B/A is computed, and also not A/A:
from itertools import combinations
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in combinations(zip(s.index, s), r=2)})
.unstack()
)
# 2 3
# 1 1.0 1.0
# 2 NaN 1.0
CodePudding user response:
So I came up with another solution as well if anyone wants to use it;
Instead of using a dataframe with 'PRESENT' and 'LOST' I have used sed to transfer them into 1's and 0's.
sed -i '' 's/PRESENT/1/g' pav-matrix.binary.txt
sed -i '' 's/LOST/0/g' pav-matrix.binary.txt
Then used this binary matrix to compare every product of each row's and get the sum of the product array.
for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}
