I have a case where I have several million puzzle solutions which fall under this format:
| 10 | 12 | 13 | 14 | |
|---|---|---|---|---|
| 55 | A | C | T | G |
| 66 | A | - | G | - |
| 77 | C | A | - | T |
| 88 | - | A | G | G |
Puzzles are taken from a far larger collection of DNA sequences, where the first column is sequence ids, and the column is the index in the full collection. Sequence ids and columns are simple lists. The solution itself is a list of strings like ['ACTG', 'A-G-', 'CA-T', '-AGG']
I need to go over every single solution, and collect results to see what ended up where. So after going through all solutions one position may look something like solutions[55][10] = {'A': 61, 'C': 0, 'G': 0, 'T': 0, '-': 13}.
Easiest case is to initialize all possible positions as {'A': 0, 'C': 0, 'G': 0, 'T': 0, '-': 0}, then go through every position in every solution, and add to the counter, however with over 10 million solutions it starts getting very slow. Can anyone suggest a more optimized way this might be accomplished?
This is what I have at the moment, solution contains a collection of identical solutions, so sol is simply the first instance, and we can add based on the number of identical solutions:
def add_to_solution_map(cols_inspected, solution, puzzle, solution_mapping):
sol = get_sol(puzzle, solution)
direction = 0 if puzzle.columns[0] < puzzle.columns[-1] else 100
for i in range(len(sol)):
seq_id = puzzle.IDs[i]
row = sol[i]
for column in cols_inspected:
nuc = row[puzzle.columns.index(column)]
solution_mapping[seq_id][column][nuc] = len(solution['playerSolutions'])
return
Here's a simplified reproducible example with a greater area of 10x10 with puzzles covering various regions
def add_to_solution_map(columns, sequences, solutions, solution_mapping):
sol = solutions[0]
for i in range(len(sol)):
seq_id = sequences[i]
row = sol[i]
for column in columns:
nuc = row[columns.index(column)]
solution_mapping[seq_id][column][nuc] = len(solutions)
return
solutions_1 = [['ACGT', 'A-G-', 'CA-T', '-AGG'], ['ACGT', 'A-G-', 'CA-T', '-AGG']]
solutions_2 = [['AAAA', '-TTT', 'CCGG', 'GG--'], ['AAAA', '-TTT', 'CCGG', 'GG--']]
solution_mapping = {}
for i in range(10):
solution_mapping[i] = {}
for j in range(10):
solution_mapping[i][j] = {'A': 0, 'C': 0, 'G': 0, 'T': 0, '-': 0}
sequence_ids = list(range(1,5))
column_ids = list(range(4,8))
add_to_solution_map(column_ids, sequence_ids, solutions_1, solution_mapping)
sequence_ids = list(range(4,8))
column_ids = list(range(0,4))
add_to_solution_map(column_ids, sequence_ids, solutions_2, solution_mapping)
for key in list(solution_mapping.keys()):
print(solution_mapping[key])
CodePudding user response:
If I understand the problem correctly, it should be possible to handle it using pandas.
Generate some sample data:
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
a = rng.choice(list('ACTG-'), (10, 5))
ids = pd.DataFrame(rng.integers(101, 104, 10), columns = ['ID'])
vals = pd.DataFrame(a, columns = [10, 20, 30, 40, 50])
df = pd.concat([ids, vals], axis=1)
print(df)
This gives:
ID 10 20 30 40 50
0 103 - G T C C
1 102 A A A A -
2 102 G - T G -
3 103 G G T T -
4 103 C - G A C
5 103 - T A G G
6 102 - A A - A
7 103 T A C T T
8 103 T A A A A
9 102 G T G C G
Calculate counts:
counts = (df.melt(['ID'])
.groupby(by=['ID', 'variable', 'value'])
.size()
.unstack(level=2)
.fillna(0)
.astype(int))
print(counts)
Output:
value - A C G T
ID variable
102 10 1 1 0 2 0
20 1 2 0 0 1
30 0 2 0 1 1
40 1 1 1 1 0
50 2 1 0 1 0
103 10 2 0 1 1 2
20 1 2 0 2 1
30 0 2 1 1 2
40 0 2 1 1 2
50 1 1 2 1 1
Then you can retrieve data from each row as a dictionary. For example:
counts.loc[103, 10].to_dict()
gives:
{'-': 2, 'A': 0, 'C': 1, 'G': 1, 'T': 2}
If you prefer the results in the form of nested dictionaries, you can add:
counts_d = (counts
.groupby(by='ID')
.apply(lambda g: g.droplevel(0).T.to_dict())
.to_dict())
Then counts_d[103][10] gives:
{'-': 2, 'A': 0, 'C': 1, 'G': 1, 'T': 2}
