Compare 2 columns in same excel sheet in pandas-CodePudding

In excel sheet i want to compare the 2 columns. The column names like file_1 and file_2. by using these 2 cols want to create the another col like 'diff' by using excel formula [countifs]. In case , if file_1 and file_2 column records matched it should be 0 (Zero) in diff column otherwise it should be 1. But, i am not as excepted result. Could please help anyone.

here is my inputs :

   file_1    file_2 
    G          G
    A          B
    C          F
    E          H
    A          C
    H          E

Output Dataframe :

    file_1       file_2       diff 
     G          G              0
     A          B              1
     C          F              1
     E          H              0
     A          C              0
     H          E              0

Sample code :

     df = pd.read_excel('file1.xlsx')
     df1 = df[df['file_1'].isin(df['file_2'])]
     df['diff'] = df1
     print(df)

CodePudding user response：

This should work;

import pandas as pd
import numpy as np
df['diff'] = np.where(df['file_1'] == df['file_2'] , '0', '1')

CodePudding user response：

You can use isin and convert to boolean:

df['diff'] = (~df['file_2'].isin(df['file_1'])).astype(int)

output:

  file_1 file_2  diff
0      G      G     0
1      A      B     1
2      C      F     1
3      E      H     0
4      A      C     0
5      H      E     0

CodePudding user response：

I think the statement:

df1 = df[df['file_1'].isin(df['file_2'])]

builds a dataframe (df1) where each row is a row from the original dataframe (df) when the value in column "file_1" (in the current row) is available in ANY row of column "file_2". It compares each value of column "file_1"with all values of column "file_2"

So, from your code, the df1 output is:

    file_1 file_2
0   G   G
2   C   F
3   E   H
5   H   E

I would suggest to use a simple statement like the following:

df2 = np.where(df.file_1 == df.file_2, 0,1)

It compares value in column "file_1" with the value in column "file_2"for each row and same row.

The code looks like the following:

import pandas as pd
import numpy as np
df = pd.read_excel('file1.xlsx')
df2 = np.where(df.file_1 == df.file_2, 0,1)
df['diff'] = df2
print(df)

The output is:

  file_1 file_2  diff
0      G      G     0
1      A      B     1
2      C      F     1
3      E      H     1
4      A      C     1
5      H      E     1