I have this df1:
CHR SNP Pos Ref Min
1 rs3094315 113934 A G
1 rs12124819 126070 A G
1 rs28765502 135853 C T
1 rs9419478 158202 C T
1 rs4881551 159076 G A
and this df2:
CHR SNP A1 A2 MAF NCHROBS
1 rs3094315 G A 0.1402 214
1 rs12124819 G A 0.1887 212
1 rs28765502 C T 0.3113 212
1 rs7419119 G T 0.2243 214
1 rs950122 C G 0.1944 216
The first three rows have the same SNP names, so what I want to do is something like merge d2 with df1 based on coincidence in "SNP" and if they match evaluate if "A1" in df2 and "Ref" in df1 are the same, if not just Change the position letters and then subtract 1-MAF value as this at the end:
CHR SNP A1 A2 MAF NCHROBS
1 rs3094315 A G 0.8598 214
1 rs12124819 A G 0.8113 212
1 rs28765502 C T 0.3113 212
1 rs7419119 G T 0.2243 214
1 rs950122 C G 0.1944 216
What I tried is
import pandas as pd
import numpy as pd
df3=pd.merge(df2,df1, on=['SNP'])
df3['subtract']=np.where(df3['A1']!=df3['Ref']###print the subtract result 1-MAF= in 'subtract col')
But I don't want to lose those values that don't match in merge and print the subtract result of 1-MAF's value.
CodePudding user response:
You can first right merge on "SNP", then use np.where to evaluate the condition. Then fill NaN values with the corresponding values. Finally drop the columns with missing values and rearrange to fit the desired outcome:
merged_df = df1.merge(df2, on='SNP', how='right', suffixes=('_',''))
merged_df['MAF'] = np.where(merged_df['Ref'].eq(merged_df['A2']), 1-merged_df['MAF'], merged_df['MAF'])
merged_df['Ref'] = merged_df['Ref'].fillna(merged_df['A1'])
merged_df['Min'] = merged_df['Min'].fillna(merged_df['A2'])
merged_df = merged_df.dropna(axis=1).drop(columns=['A2','A1']).rename(columns={'Ref':'A1', 'Min':'A2'})[['CHR','SNP','A1','A2','MAF','NCHROBS']]
Output:
CHR SNP A1 A2 MAF NCHROBS
0 1 rs3094315 A G 0.8598 214
1 1 rs12124819 A G 0.8113 212
2 1 rs28765502 C T 0.3113 212
3 1 rs7419119 G T 0.2243 214
4 1 rs950122 C G 0.1944 216
