I want to extract the subsequences indicated by the first and last locations in data frame 'B'. The algorithm that I came up with is:
- Identify the rows of B that fall in the locations of A
- Find the relative position of the locations (i.e. shift the locations to make them start from 0)
- Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
CodePudding user response:
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last'] 1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
- First, use
df.merge_asofto matchfirstvalues fromBwithfirstvalues fromA. I.e.3, 8will match with1, and105, 115will match with100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index1or100instead of a normal0. - We use this last bit of information to find out where the string slice should start and end. So, we do
out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset"3to2(minus 1) and105to5(minus 100). - Now, we can use
df.applyto get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have usedSeries.str.sliceinstead. - Finally, we assign the result to
out(as colsequences), drop the cols we no longer need, and we usedf.updateto "reset" the columnsfirstandlast.
