Optimized multiple list traversal in parallel function-CodePudding

I wrote a function to traverse multiple DNA sequence lists and graph their homology as a fraction of identical base content, so that one can easily see regions of high homology vs. regions of low homology across multiple species.

To do this, I pass a numpy array of m sequence lists from an alignment file (clustal in this case):

    [['-' '-' '-' ... 'C' 'C' 'T']
     ['-' '-' '-' ... '-' '-' '-']
     ['-' '-' '-' ... '-' '-' '-']
     ['G' 'C' 'T' ... '-' '-' '-']]

I then iterate over each list and track the the values of each column. I add the values of each column to a dictionary if they're A T C or G and ignore all other cases (N or -). Not caring about which base pair is most common, I then take the max value of the dictionary and store the max/m-species in a separate list, via the following function:

def createHists(SequenceArray):
    Conserved = []
    Ind = 0
    while Ind < len(SequenceArray[0]):
        NucCounts = {"A":0, "T":0, "C":0, "G":0}
        ColumnConservation = []
        for Seqs in SequenceArray:
            ColumnConservation.append(Seqs[Ind])
        for Nucs in ColumnConservation:
            if Nucs in NucCounts:
                NucCounts[Nucs]  = 1
        if NucCounts[max(NucCounts, key=NucCounts.get)] > 1:
            ConservedN = NucCounts[max(NucCounts, key=NucCounts.get)]/len(ColumnConservation)
            Conserved.append(ConservedN)
        else:
            ConservedN = 0
            Conserved.append(ConservedN)
        Ind  = 1
    return(Conserved)

And the output is simply:

[0.75, 0.5, 0.75, 0.5, 0.5, 0.5, 0.75, 0.75, 0.5, 0.75, 0.5, 0.5, 0.75, 0.75]

My question is, given that I am iterating over every row is there a way to make this faster? Not that it isn't fast already (current size of my alignment file is 45000bp) but I was wondering if there are built in libraries that could parallelize multiple list traversals more efficiently, like itertools for instance. The major caveat here is that the number of sequence lists is unknown, it could be 4 like in this example but it could be more.

CodePudding user response：

I think you might have overthought this problem a bit.

from collections import Counter
import numpy as np

VALID_BASES = ['A', 'T', 'G', 'C']

# Array where rows are samples and columns are split from a string
arr = np.array([['-', '-', '-', 'C', 'C', 'T'],
     ['-', '-', '-', '-', '-', '-'],
     ['-', '-', '-', '-', '-', '-'],
     ['G', 'C', 'T', '-', '-', '-']])

Use a python counter:

counts = [Counter(x) for x in arr.T]

Then take the max proportion of your valid character:

max_proportion = [max(count[v] / arr.shape[0] for v in VALID_BASES) for count in counts]

>>> max_proportion
[0.25, 0.25, 0.25, 0.25, 0.25, 0.25]