In excel sheet i want to compare the 2 columns. The column names like file_1 and file_2. by using these 2 cols want to create the another col like 'diff' by using excel formula [countifs]. In case , if file_1 and file_2 column records matched it should be 0 (Zero) in diff column otherwise it should be 1. But, i am not as excepted result. Could please help anyone.
here is my inputs :
file_1 file_2
G G
A B
C F
E H
A C
H E
Output Dataframe :
file_1 file_2 diff
G G 0
A B 1
C F 1
E H 0
A C 0
H E 0
Sample code :
df = pd.read_excel('file1.xlsx')
df1 = df[df['file_1'].isin(df['file_2'])]
df['diff'] = df1
print(df)
CodePudding user response:
This should work;
import pandas as pd
import numpy as np
df['diff'] = np.where(df['file_1'] == df['file_2'] , '0', '1')
CodePudding user response:
You can use isin and convert to boolean:
df['diff'] = (~df['file_2'].isin(df['file_1'])).astype(int)
output:
file_1 file_2 diff
0 G G 0
1 A B 1
2 C F 1
3 E H 0
4 A C 0
5 H E 0
CodePudding user response:
I think the statement:
df1 = df[df['file_1'].isin(df['file_2'])]
builds a dataframe (df1) where each row is a row from the original dataframe (df) when the value in column "file_1" (in the current row) is available in ANY row of column "file_2". It compares each value of column "file_1"with all values of column "file_2"
So, from your code, the df1 output is:
file_1 file_2
0 G G
2 C F
3 E H
5 H E
I would suggest to use a simple statement like the following:
df2 = np.where(df.file_1 == df.file_2, 0,1)
It compares value in column "file_1" with the value in column "file_2"for each row and same row.
The code looks like the following:
import pandas as pd
import numpy as np
df = pd.read_excel('file1.xlsx')
df2 = np.where(df.file_1 == df.file_2, 0,1)
df['diff'] = df2
print(df)
The output is:
file_1 file_2 diff
0 G G 0
1 A B 1
2 C F 1
3 E H 1
4 A C 1
5 H E 1
