getting final column after comparing 4 columns using np.where()-CodePudding

using np.where i am able to get 4 columns match1,match2,match3 and match4.Final column MATCHED has to be updated based on the values of match 1,match2,match3 and match4.if all 4 are yes then i have to update MATCHED as 'Yes' if any 3 out of 4 are yes then also 'Yes' is to be updated.Else No.


final_data = final_data.copy()
final_data['Match1'] = np.where(final_data['PROCESSOR_sub_column'] == final_data['PROCESSOR_Title'] , 'Yes', 'No')
final_data['Match2'] = np.where( final_data['RAM'] == final_data['RAM_Title'] , 'Yes', 'No')
final_data['Match3'] = np.where( final_data['Storage'] == final_data['STORAGE_Title'] , 'Yes', 'No')
final_data['Match4'] = np.where( final_data['Storage Type'] == final_data['STORAGE_TYPE_Title'] , 'Yes', 'No')

 if (final_data['Match1']&final_data['Match2']&final_data['Match3']&final_data['Match4'] == 'Yes'):

 final_data[Matched] = 'Yes'

attaching the data already generated as screenshot :data looks like this

i also tried to get the column Matched directly using np.where() , but i was not successful to check four conditions at the same time. also i don't know whst i should pass as X.y in np.where(,x,y) while using 4 conditions.

apologies if this question is repeat, i have tried my level best to read all previous posts about related topic.

CodePudding user response：

You can compare all 4 columns for Yes, then count Trues by sum per rows and compare if greater or equal by 3 by Series.ge:

mask = df[['Match1','Match2','Match3','Match4']].eq('Yes').sum(axis=1).ge(3)
final_data['Matched'] = np.where(mask, 'Yes', 'No')

Another solution without helper columns id compare filtered DataFrames:

np.random.seed(2022)

c=['PROCESSOR_sub_column','RAM','Storage','Storage Type','PROCESSOR_Title',
   'RAM_Title','STORAGE_Title','STORAGE_TYPE_Title']
final_data = pd.DataFrame(np.random.randint(2, size=(10,8)), columns=c)

df1 = final_data[['PROCESSOR_sub_column','RAM','Storage','Storage Type']]
df2 = final_data[['PROCESSOR_Title','RAM_Title','STORAGE_Title','STORAGE_TYPE_Title']]
mask = df1.eq(df2.to_numpy()).sum(axis=1).ge(3)

final_data['Matched'] = np.where(mask, 'Yes', 'No')
print (final_data)
   PROCESSOR_sub_column  RAM  Storage  Storage Type  PROCESSOR_Title  \
0                     1    0        1             0                1   
1                     0    0        0             0                1   
2                     1    1        0             0                0   
3                     0    0        0             1                0   
4                     0    1        0             0                1   
5                     1    1        1             1                0   
6                     0    1        0             1                1   
7                     0    1        1             0                0   
8                     1    1        1             1                1   
9                     1    1        0             0                0   

   RAM_Title  STORAGE_Title  STORAGE_TYPE_Title Matched  
0          1              0                   1      No  
1          1              1                   1      No  
2          1              0                   1      No  
3          0              0                   1     Yes  
4          1              1                   1      No  
5          0              0                   0      No  
6          0              0                   0      No  
7          1              0                   1      No  
8          1              0                   1     Yes  
9          0              1                   0      No