I have two dataframes such as
Tab1 :
ORFs_values Groups SP1 SP2 SP3
SP_ORF1 Group1 1 1 0
SP_ORF1 Group2 0 0 0
SP_ORF1 Group3 0 1 0
SP_ORF1 Group4 1 1 1
SP_ORF1 Group5 1 1 1
and another:
Tab2
Groups SP_names SP_names2
Group1 SP1 SP1_A
Group1 SP2 SP2_A
Group3 SP2 SP2_BL
Group4 SP1 SP1_BI
Group4 SP2 SP2_OP
Group4 SP3 SP3_I
Group5 SP3 SP3_0
Group5 SP3 SP3_P
Group5 SP3 SP3_K
And I would like to replace within the Tab1 all cells with 1, and with their corresponding SP_names2 for each matching Groups and SP_names.
For instance, in the first row of Tab1, the SP_names2 value for SP1 in Group1 is SP1_A, so I replace 1 by SP1_A and so on...
When there is duplicated as in the Group5 example, I need to add all the duplicates SP_names2 separated by a pipe?
I should then get the following output:
New_tab1:
ORFs_values Groups SP1 SP2 SP3
SP_ORF1 Group1 SP1_A SP2_A 0
SP_ORF1 Group2 0 0 0
SP_ORF1 Group3 0 SP2_BL 0
SP_ORF1 Group4 SP1_BI SP2_OP SP3_I
SP_ORF1 Group5 0 0 SP3_0|SP3_P|SP3_K
So far I got this try :
df = df.set_index('Groups')
(df.mask(df.eq(1),
df2.set_index(['Groups','SP_names'])['SP_names2'].unstack())
.reset_index())
But it does not take into account the multiple SP_names2 elements separated by a pipe...
Here are the two df in dict format if it can help:
Tab1
{'ORFs_values': {0: 'SP_ORF1', 1: 'SP_ORF1', 2: 'SP_ORF1', 3: 'SP_ORF1'}, 'Groups': {0: 'Group1', 1: 'Group2', 2: 'Group3', 3: 'Group4'}, 'SP1': {0: 1, 1: 0, 2: 0, 3: 1}, 'SP2': {0: 1, 1: 0, 2: 1, 3: 1}, 'SP3': {0: 0, 1: 0, 2: 0, 3: 1}}
Tab2
{'Groups': {0: 'Group1', 1: 'Group1', 2: 'Group3', 3: 'Group4', 4: 'Group4', 5: 'Group4', 6: 'Group5', 7: 'Group5', 8: 'Group5'}, 'SP_names': {0: 'SP1', 1: 'SP2', 2: 'SP2', 3: 'SP1', 4: 'SP2', 5: 'SP3', 6: 'SP3', 7: 'SP3', 8: 'SP3'}, 'SP_names2': {0: 'SP1_A', 1: 'SP2_A', 2: 'SP2_BL', 3: 'SP1_BI', 4: 'SP2_OP', 5: 'SP3_I', 6: 'SP3_0 ', 7: 'SP3_P', 8: 'SP3_K '}}
CodePudding user response:
Use DataFrame.pivot_table with aggregate function join:
df2['SP_names2'] = df2['SP_names2'].str.strip()
df2 = df2.pivot_table(index='Groups',
columns='SP_names',
values='SP_names2',
aggfunc='|'.join,
fill_value=0)
print (df2)
SP_names SP1 SP2 SP3
Groups
Group1 SP1_A SP2_A 0
Group3 0 SP2_BL 0
Group4 SP1_BI SP2_OP SP3_I
Group5 0 0 SP3_0|SP3_P|SP3_K
If non missing values are same like 1 in df1 is possible use DataFrame.join:
df = df1[['ORFs_values','Groups']].join(df2, on='Groups').fillna(0)
print (df)
ORFs_values Groups SP1 SP2 SP3
0 SP_ORF1 Group1 SP1_A SP2_A 0
1 SP_ORF1 Group2 0 0 0
2 SP_ORF1 Group3 0 SP2_BL 0
3 SP_ORF1 Group4 SP1_BI SP2_OP SP3_I
If need replace 1 by df2:
df = df.set_index('Groups')
df = df.mask(df.eq(1), df2).reset_index().reindex(df1.columns, axis=1)
print (df)
ORFs_values Groups SP1 SP2 SP3
0 SP_ORF1 Group1 SP1_A SP2_A 0
1 SP_ORF1 Group2 0 0 0
2 SP_ORF1 Group3 0 SP2_BL 0
3 SP_ORF1 Group4 SP1_BI SP2_OP SP3_I
